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Abstract 

This thesis investigates timing analysis and optimization issues in synchronous circuitry. The major 
thrust of our work is a collection of provably correct and efficient algorithms that perform a variety of 
architectural-level operations on level-clocked circuitry, that is, circuitry that employs a multiphase 
clocking scheme and level-clocked storage elements. We implemented several of these algorithms in 
Tim, a timing package for two-phase, level-clocked circuitry. Using Tim we empirically compared 
the performance and the storage element requirements of edge-triggered and equivalent level-clocked 
implementations of synchronous circuitry. Our research contributes towards a better understanding 
of the complex timing issues involved in level-clocking and provides the enabling technology for 
bringing level-clocking into the mainstream of circuit design. 

We begin by describing algorithms for optimizing edge-triggered circuitry in Chapter 1. This 
kind of circuitry is particularly popular among designers, because of its intuitive operation and ease 
of implementation. Our work in this area focuses on optimization by retiming, an architectural 
transformation that speeds up circuits by relocating their storage elements. A highlight of our 
research is an 0(V l l 2 E\gV)-t\me algorithm for retiming unit-delay circuitry for maximum speed. 
This is the asymptotically fastest algorithm known to date for the problem. 

In Chapter 2 we move on to investigate timing in a general class of level-clocked circuitry. 
The operation of this circuitry is much more complex and difficult to understand than that of 
edge-triggered circuitry. We first describe polynomial-time algorithms that analyze the timing of 
level-clocked circuitry. Specifically, we give algorithms that verify the proper timing of a circuit by a 
given clocking scheme and analyze the sensitivity of its timing to changes in the propagation delays 
of its components. We also describe a polynomial-time algorithm for clock tuning, an optimization 
that speeds up level-clocked circuits by adjusting the parameters of their clocking schemes. We 
extend retiming to encompass level-clocked circuitry, and we describe polynomial-time algorithms 
that perform retiming or simultaneous retiming and tuning. We also present a polynomial-time 
retiming algorithm that minimizes the number of storage elements in level-clocked circuitry without 
degrading its performance. Major results of our research in this area are an 0(V E + V 2 lg U)-time 
algorithm for retiming with symmetric clocking schemes and an 0(V E + U 2 lgU)-time algorithm 
for analyzing the sensitivity of a circuit's timing. 

Chapter 3 describes Tim, a versatile and efficient design automation tool for two-phase, level- 
clocked circuitry that is based on our timing analysis and optimization algorithms. Tim has been 
implemented using the C programming language and has been integrated into the SIS tools from 
Berkeley. Our software runs on a workstation under the UNIX environment, and it is available over 
the Internet by ftp. 

We employed Tim to empirically compare edge-triggered and functionally equivalent level- 
clocked implementations of synchronous circuitry in terms of speed and number of storage ele- 
ments. Our results, which are presented in Chapter 4, show that although two-phase, level-clocked 
circuitry has the theoretical potential to operate faster than conventional edge-triggered circuitry, 
edge-triggered circuitry can often perform just as well. These empirical results indicate the special 
circumstances, however, in which level clocking has an advantage. Moreover, our empirical results 



also indicate another advantage of optimized level-clocked designs: they contain substantially fewer 
storage elements than edge-triggered designs that operate at the same speed. 

Keywords: timing analysis, timing optimization, retiming, computer-aided design, VLSI design, 
digital systems, synchronous circuitry, multiphase circuitry, level-clocked circuitry, graph algorithms, 
combinatorial optimization. 
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Introduction 



For several years now, VLSI designers have been routinely implementing synchronous dig- 
ital systems with clocked storage elements that are synchronized by the falling edge of a 
single clocking waveform. The operation of this edge-triggered circuitry is quite simple and 
intuitive, and it can be described in just two sentences. When the clocking waveform falls, 
each storage element instantaneously samples its input and asserts that value at its output 
for the remainder of the clock cycle. The circuit operates correctly if the time between 
every two consecutive falling edges is long enough to allow for all signals to propagate along 
every combinational path in the circuit. Due to its simplicity, edge-triggered circuitry has 
become particularly popular among both circuit designers and CAD tool designers. 

An alternative method to implement synchronous systems is to employ a multiphase 
clocking scheme and to use level-clocked latches as storage elements. In this so-called level- 
clocked circuitry, a latch operates like a traffic light. While the clocking waveform is high, 
the latch is transparent and allows data to flow through unimpeded. When the clocking 
waveform falls, however, the latch samples its input signal and asserts it at its output for the 
remainder of the clock cycle. The signals that flow through a latch during its transparent 
phase can initiate the computation of the next combinational stage before the beginning of 
the next clock cycle, a phenomenon known as cycle stealing. Thus, level-clocked circuitry 
has the theoretical potential to operate faster than edge-triggered circuitry, and it is often 
employed in high-performance designs. Unfortunately, cycle stealing perplexes the operation 
of level-clocked circuitry, because data can ripple through several stages of storage elements 
before their propagation is complete. As a result, the design of level-clocked circuitry is 
notoriously difficult, and there are almost no automation tools available to facilitate this 
task. 

In this thesis we present the enabling technology that will bring level-clocking into the 
mainstream of circuit design. Specifically, we describe a collection of provably correct and 
efficient algorithms for analyzing and optimizing the timing of level-clocked circuitry. We 
implemented several of our algorithms in a software package called Tim, and we used our 
tool to empirically compare edge-triggered and functionally equivalent level-clocked designs. 
We believe that the results of our research provide VLSI designers with the tools and the 
intuition required to design level-clocked circuitry with the same degree of confidence, ease, 
and efficiency that is customary for edge-triggered designs. 

9 



10 INTRODUCTION 

Motivation and Related Work 

Since the early years of integrated digital systems, timing analysis and timing optimization 
had been identified as two of the most important problems in the design and implementation 
of synchronous circuitry. The development of sophisticated timing analysis tools for edge- 
triggered systems had already started in the early 70s, several years before the advent of 
VLSI design [17]. One of the best known programs for analyzing circuitry that employed 
edge-triggered latches was SCALD [34]. 

Level-clocked latches and multiphase clocks became commonplace with the arrival of 
VLSI. The first timing analysis programs that accounted for level-clocked latches were TV 
and CRYSTAL which appeared in the early 80s [23, 40]. These systems, however, were 
hampered by the fact that they could not handle signals that cross phases. Agrawal made 
an important contribution with his timing analysis program which accounted for signals 
that cross phases but not for signals that cross clock cycles [1]. In the mid 80s, Szymanski 
presented the timing analysis program LEADOUT which could handle signals that cross 
phases as well as signals that cross clock cycles [54] . A few years later, Ishii and Leiserson 
presented a formal timing analysis for a general class of level-clocked circuitry, and they 
proved that their algorithms run in polynomial time [20]. In their analysis, each block of 
combinational logic was assumed to have a minimum propagation delay equal to zero and 
a maximum propagation delay that was independent of the block's functionality. 

The first procedures for optimizing the timing of level-clocked circuitry by selecting 
appropriate parameters for their clocking waveforms appeared in the late 80s [6, 57]. A 
mathematical framework for timing analysis and optimal selection of clocking parameters 
was presented by Sakallah, Mudge and Olukotun [51]. In this work, each block of combina- 
tional logic was assumed to have a nonzero minimum propagation delay and a nonzero max- 
imum propagation delay that were independent of the block's functionality. The retiming 
optimization, a circuit transformation that has been extensively studied for edge-triggered 
circuitry [28, 29, 31, 45], was investigated in the context of single-phase, level-clocked cir- 
cuitry by Shenoy, Brayton and Sangiovanni-Vincentelli [52]. In all these papers, the authors 
described iterative approximation or successive relaxation schemes to solve the problems. 
The proposed schemes, however, were not guaranteed to run in polynomial time. 

In this thesis we present a rigorous investigation of timing in level-clocked circuitry. 
Assuming the same delay model as the one in [20], we describe provably correct and provably 
efficient algorithms for analyzing and optimizing the timing of a general class of level- 
clocked circuitry that employs multiphase clocking schemes. We believe that our work will 
transform the design of level-clocked circuitry into a substantially easier and more reliable 
task. Our timing analysis algorithms extend beyond simple timing verification by providing 
information about the timing slack of the combinational logic blocks in the circuit. These 
are the first polynomial-time algorithms presented for analyzing the timing sensitivity of 
level-clocked circuitry. Moreover, our timing optimization algorithms do not only select 
the optimal parameters for the clocking waveforms of a level-clocked circuit, but they also 
improve its performance by retiming, that is, by relocating its latches without changing its 
functionality. Retiming has been already investigated in edge-triggered circuitry [28, 29, 31, 
45] and in single-phase, level-clocked cicuitry [52]. Our algorithms are the first, however, to 
extend retiming to level-clocked circuitry with multiple clocking waveforms. 
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Our work extends beyond the borders of theory. We implemented several of our algo- 
rithms in Tim, a timing package for two-phase, level-clocked circuitry. We have used Tim to 
empirically compare edge-triggered and functionally equivalent level-clocked designs. Our 
experiments demonstrate specific circuit characteristics that allow level-clocking to achieve 
its speed potential. They also show that level-clocking leads to circuits with fewer stor- 
age elements in high-performance designs. Despite the large amount of work in the area, 
our contribution is the first attempt to empirically quantify the performance differences of 
edge-triggering and two-phase clocking. We believe that our results will prove particularly 
useful to designers of custom integrated circuitry. 

Independently of our work, Lockyear and Ebeling have presented retiming algorithms 
for multiphase, level-clocked circuitry in [32]. The general retiming algorithm in that paper 
has the same computational complexity as the one we present in this thesis. We present 
an asymptotically more efficient algorithm, however, when the clocking waveforms are sym- 
metric. Some retiming heuristics were also presented in [3], but their correctness and com- 
putational complexity was not analyzed. 

Recently, Szymanski and Shenoy presented a provably correct algorithm for timing ver- 
ification that has the same computational complexity as the one we present in this thesis 
and assumes a more general delay model in which the blocks of combinational logic have 
nonzero minimum propagation delays [55]. An efficient algorithm for selecting the param- 
eters of the optimal clocking waveforms in this more general model appeared in [56]. An 
efficient retiming algorithm for this model, however, is not known yet. 



Overview of the Thesis 

This thesis is organized in four chapters. Parts of the material in each chapter have appeared 
in conferences during the last three years. Other parts have been submitted for publication. 

Optimizing Edge-Triggered Circuitry 

In Chapter 1 we investigate algorithms for optimizing edge-triggered circuitry. Early ver- 
sions of this work appeared in [42, 43]. 

An edge-triggered circuit comprises blocks of combinational logic that perform functions 
and edge-triggered latches (also called registers) that implement storage elements. Such a 
circuit is a simple implementation of the semisystolic model of computation that can be 
used to design parallel algorithms. In this chapter, we give tight bounds on the minimum 
clock period that can be achieved by retiming an edge-triggered circuit. Our bounds are 
independent of the size of the circuit; they are expressed in terms of the maximum ratio of 
the total delay over the total register count around any cycle in the circuit graph and in terms 
of the maximum propagation delay c? max of the combinational logic blocks. Moreover, our 
bounds characterize exactly the minimum clock period that can be achieved by retiming a 
unit-delay circuit, that is, a circuit in which all combinational logic blocks have propagation 
delay of one unit. 

We also present more efficient algorithms for several important problems related to 
retiming. Specifically, we give an 0(V 1 ^ 2 Elg V r )-time algorithm for retiming unit-delay 
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circuits with the minimum possible clock period. This is the asymptotically fastest al- 
gorithm known to date for this problem, and its efficiency stems from our exact charac- 
terization of the minimum clock period for unit-delay circuits. For the general case, in 
which circuits also include combinational logic blocks with non-unit delays, we describe an 
0(VE lg (i max )-time algorithm for retiming with minimum clock period. We also describe an 
0(V 1 / 2 Elg 2 (Vd max ))-time algorithm for retiming with a clock period that does not exceed 
the optimal by more than an additive factor of c? max — 1. Finally, we give an 0(.Elg(i max )-time 
algorithm for pipelining combinational circuits with the minimum possible clock period. 

Analyzing and Optimizing Level-Clocked Circuitry 

In Chapter 2 we present our algorithms for analyzing and optimizing the timing of level- 
clocked circuitry. Earlier versions of this work appeared in [21, 22, 44, 46, 47, 48]. 

A level-clocked circuit comprises blocks of combinational logic that perform functions 
and level-clocked latches that implement storage elements. In this chapter we describe algo- 
rithms that verify whether a circuit is properly timed by a given clocking scheme and analyze 
the sensitivity of the circuit's timing to changes in the propagation delays of its components. 
We also investigate two strategies for reducing the clock period of a level-clocked circuit: 
clock tuning, which adjusts the waveforms that clock the circuit, and retiming, which relo- 
cates circuit latches. These methods can be used to convert a circuit with edge-triggered 
latches into a faster level-clocked one. 

The algorithms in this chapter are presented in terms of two-phase, level-clocked cir- 
cuitry. At the end of the chapter we extend our algorithms to encompass a general class 
of level-clocked circuitry with multiple phases. We model a two-phase circuit as a graph 
G = (V, E) whose vertex set V is a collection of combinational logic blocks, and whose edge 
set E is a set of interconnections. Each interconnection passes through zero or more latches, 
where each latch is clocked by one of two periodic, nonoverlapping waveforms, or phases. 

We give efficient polynomial-time algorithms for problems involving the timing analysis 
and optimization of two-phase circuitry. Included are algorithms for 

• verification of proper timing: 0(VE) time. 

• noncritical sensitivity analysis for a single combinational block: O(VE) time. 

• noncritical sensitivity analysis for all combinational blocks: 0(VE + V 2 \gV) time. 

• critical sensitivity analysis for a single combinational block: 0(VE) time. 

• minimization of clock period by clock tuning: 0(VE) time. 

• retiming to achieve a given clock period when the phases are symmetric: 
0(VE + V 2 \gV) time. 

• retiming to achieve a given clock period when either the duty cycle (high time) of one 
phase or the ratio of the phases' duty cycles is fixed: 0(V 3 ) time. 

By characterizing the set of possible clock periods under any retiming of the circuit, we 
are able to obtain polynomial-time algorithms for clock period minimization by: 

• retiming and tuning when the duty cycles of the two phases are required to be equal: 

0(V 2 E) time. 
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• retiming and tuning when either the duty cycle of one phase is fixed or the ratio of 
the phases' duty cycles is fixed: 0(V 2 E + V 3 IgU) time. 

• simultaneous retiming and clock tuning with no conditions on the duty cycles of the 
two phases: 0(V U ) time. 

Unfortunately, this last algorithm is not practical. For these problems, however, we present 
fully polynomial-time approximation schemes for clock period minimization within any 
given relative error e > 0. Specifically, we give an 

• 0((VE + V 2 lg V) lg(U/e))-time algorithm for retiming and tuning when the duty 
cycles of the two phases are required to be equal. 

• 0(V 3 lg(U/e))-time algorithm for retiming and tuning when either the duty cycle of 
one phase is fixed or the ratio of the phases' duty cycles is fixed. 

• 0(V 3 (l/e) lg(l/e) + (VE + V 2 lg V) lg(U/e))-time algorithm for simultaneous retiming 
and clock tuning with no conditions on the duty cycles of the two phases. 

The first two of these approximation algorithms can be used to obtain the optimum clock 
period in the special case where all propagation delays are integers. 

At the end of this chapter, we generalize most of the results for two-phase clocking 
schemes to simple multiphase clocking disciplines, including ones with overlapping phases. 
Typically, the algorithms to verify and optimize the timing of A;-phase circuitry are at most 
a factor of k slower than the corresponding algorithms for two-phase circuitry. Sensitivity 
analysis for all combinational blocks and retiming with symmetric phases, however, can still 
be performed in asymptotically the same number of steps as for two-phase circuitry. 

TlM: A Timing Package for Level-Clocked Circuitry 

In Chapter 3 we describe Tim, a versatile and efficient tool for verifying and optimizing 
the timing of two-phase, level-clocked circuitry. An earlier version of this work appeared in 
[48]. 

Tim is based on the algorithms that we present in Chapter 2 and performs a wide variety 
of functions such as timing verification, sensitivity analysis, clock tuning, retiming and clock 
tuning for maximum speed of operation, and retiming for minimum number of latches. In 
Tim we have extended our algorithms to handle nonideal latches. All latches are assumed 
to have equal propagation delays, equal setup times and equal hold times. Moreover, the 
implementation of our retiming algorithms does not relocate the input/output latches, thus 
preserving the input/output phases and the total latency of the circuit. 

The entire software package has been developed using the C programming language 
in a UNIX environment. The system has been integrated in the SIS tools from Berkeley, 
and it is available over the Internet by ftp 1 . On a SPARCstation 2 with 64MB of main 
memory, each of Tim's timing analysis functions requires a couple of minutes for a circuit 
with approximately 1,500 gates. The retiming functions are slower, however, and they 
require approximately 35 minutes for a circuit of the same size. Tim's retiming algorithms 
operate on a dense graph representation of the problem. Almost half of the time required 
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by its retiming operations is spent on constructing this graph. We believe that the practical 
performance of some of the retiming algorithms will be substantially improved by adjusting 
them to operate on a sparse graph representation. 

Edge-Triggering vs. Level-Clocking 

In Chapter 4 we present an empirical comparison of edge-triggered and two-phase, level- 
clocked circuitry in terms of speed and storage elements requirements. An earlier version 
of this work appeared in [47]. 

Level-clocked circuitry that employs a two-phase, nonoverlapping clocking scheme has 
the theoretical potential to operate up to twice as fast as edge-triggered circuitry. Using 
Tim, we have run experiments that demonstrate, however, that edge-triggering is often just 
as fast as two-phase clocking, and that the speed potential of two-phase clocking is generally 
not obtained except when the delay between any two consecutive latches is roughly uniform 
and close to the maximum combinational block delay. Moreover, our experiments show 
that asymmetrical clocking of a two-phase circuit often does not provide any speedup over 
optimal symmetric clocking schemes. 

Level-clocking can lead to substantial reductions in the number of storage elements in 
a circuit, however. Our experiments show that for edge-triggered circuitry that has been 
retimed to operate with the minimum possible clock period, by replacing each edge-triggered 
latch by a pair of level-clocked latches and subsequently retiming the resulting two-phase 
circuit, the number of storage elements can be reduced by up to 38% without increasing 
the clock period of the final design or affecting its input/output specification. Reductions 
of greater than 25% were achieved for more than one third of the circuits we tested. 

We ran our tests on MCNC benchmark circuits, AT&T communication circuits, and 
custom circuitry designed for MIT's Alewife machine. 



Chapter 1 

Optimizing Edge- Triggered 
Circuitry 

1.1 Introduction 

This chapter describes algorithms for optimizing edge-triggered circuitry, that is, syn- 
chronous circuits built of functional elements and globally clocked registers. Retiming, 
which was introduced in [27, 28, 29] and treated in [31], is a well-known design automation 
technique which transforms a given edge-triggered circuit into a faster circuit, that is, one 
with a shorter clock period, by relocating the registers of the given circuit while preserv- 
ing its functionality. In this chapter we further investigate retiming and provide results of 
theoretical as well as practical interest. Specifically, we give tight bounds on the minimum 
clock period that can be achieved by retiming a circuit in terms of the maximum delay-to- 
register ratio of the cycles in the circuit graph and the maximum propagation delay of the 
combinational logic blocks in the circuit. These bounds do not depend on the size of the 
circuit and characterize exactly the minimum clock period that can be achieved by retiming 
a unit-delay circuit. We exploit these bounds to obtain asymptotically improved algorithms 
for several important problems related to retiming. A highlight of the research presented in 
this chapter is the asymptotically fastest algorithm known to date for retiming unit-delay 
circuitry. 

We model a synchronous circuit according to [27, 28, 29] by a circuit graph G = 
(V,E,d,w). A vertex v G V corresponds to a functional element of the circuit, and an 
edge u — > v € E corresponds to a wire between the functional elements u and v. The inte- 
ger edge-weight w(e) is the number of registers on the directed edge e. The vertex-weight 
d(v) is the propagation delay of the signals through the functional element v. For simplicity, 
and without any loss of generality, we assume that vertex-weights are also integers. This 
assumption is justified by the fact that digital computers represent data with only a finite 
number of bits. Figure l-l(a) illustrates a synchronous circuit with unit-delay functional 
elements. 

Intuitively, the circuit operates as follows. Between any two consecutive clock ticks, 
signals propagate along wires and ripple concurrently through the functional elements. By 
the end of a clock period all signals must have settled in the registers of the circuit. Although 

15 



16 



CHAPTER 1. OPTIMIZING EDGE-TRIGGERED CIRCUITRY 




Figure 1-1: (a) A synchronous circuit G with unit-delay functional elements. The vertices represent 
functional elements, the edges represent wires, and the rectangles represent registers. The integers 
next to the edges indicate number of registers. The clock period of the circuit is 4 units of time 
(path FABG). (b) A retiming of G. The integer assignment is indicated next to the vertices. The 
clock period of this circuit is 2 units of time. 



the functional elements of the circuit operate in parallel, some signals may require more 
time to settle than others, because they experience longer propagation delays along their 
paths. The clock period Q(G) of the system is defined naturally as the propagation delay of 
the longest register-free path in the circuit, which is well-defined for synchronous circuits 
in which every directed cycle contains at least one register. For example, the clock period 
of the unit-delay circuit in Figure l-l(a) is 4 units of time (path FABG). 

A retiming r of G is an assignment r : V — > Z, such that w(e) + r(u) — r(v) > 0. 
Given r, we transform the circuit by removing r(v) registers from every edge coming into 
v, and adding r(v) registers to every edge going out of v. This results to a retimed circuit 
G r = (V,E,d,w r ), with w r (e) = w(e) + r(u) — r(v) > for every edge u — > v € E. In 
Figure l-l(b) we have retimed the circuit of Figure 1-1 (a) so that the resulting circuit has 
clock period 2 units of time. Note that the total number of registers around any cycle in 
the circuit remains invariant after retiming. 

The delay-to-register ratio of a directed cycle in a circuit is defined as the ratio of the 
total propagation delay around the cycle over the total number of registers in the cycle. For 
example, the delay-to-register ratio of the directed cycle ABCDEF in Figure 1-1 is 6/6 = 1. 
Observe that the delay-to-register ratio of any cycle is the same in the original circuit G 
and in the retimed circuit G>, since both the total delay and the number of registers around 
any cycle remain invariant after retiming. This observation suggests a relation between the 
minimum clock period that we can achieve by retiming G and the maximum delay-to-register 
ratio in G. Let us illustrate this relation by means of our example circuit in Figure 1-1. 
Consider the cycle ABGF with delay-to-register ratio 4/2 = 2, which is the maximum among 
the three directed cycles in G. It is not possible to distribute the registers around ABGF 
in a way that achieves a clock period shorter than the average delay per register in ABGF, 
since the delay-to-register ratio around any cycle in G r remains invariant. Therefore, G 
cannot be retimed to achieve a clock period smaller than 2, the delay-to-register ratio of 
ABGF, and the circuit in Figure l-l(b) has achieved its minimum possible clock period. 

In this chapter we prove that rounding up the maximum delay-to-register ratio in any 
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circuit yields a lower bound on the minimum clock period that we can achieve by retiming 
the circuit. Moreover, we show that there always exists a retiming that comes within an 
additive factor of c? max — 1 away from the lower bound, where c? max denotes the maximum 
among the propagation delays of the functional elements in the circuit. As a special case of 
these general bounds we have that the maximum delay-to-register ratio in any unit-delay 
circuit characterizes exactly the minimum clock period achievable by retiming. (The result 
for the special case of unit-delay circuitry has been claimed independently in [8].) Our 
tight bounds yield asymptotically more efficient algorithms for several important problems 
related to retiming, such as minimum clock-period retiming, retiming for approximately 
minimum clock-period, and minimum clock-period pipelining. 

The remainder of this chapter is structured as follows. In Section 1.2 we give some back- 
ground material on retiming and its relation with the single-source shortest-paths problem. 
In Section 1.3 we state and prove the tight bounds on the minimum clock period $ min (G) 
that can be achieved by retiming a circuit G. 

In Section 1.4, we give an 0(V 1 l 2 E\g V r )-time algorithm that retimes any unit-delay 
circuit to achieve the minimum possible clock period $ min (G). This result improves the 
O(VElgV) bound from [31]. Our algorithm is based on the exact characterization of 
^min(G) as well as on scaling algorithms for finding single-source shortest-paths and the 
minimum cycle-mean in a graph [12, 39]. 

In Section 1.5, we present algorithms for the general case, in which circuits include 
combinational logic blocks of non-unit delay. Specifically, Subsection 1.5.1 describes an 
0(VElgd maK )-t\me algorithm for minimum clock period retiming. This algorithm performs 
a preprocessing step for computing the maximum delay-to-register ratio in the circuit, fol- 
lowed by a binary search of c? max possible clock periods. Assuming that the maximum 
propagation delay c? max of the circuit components grows subpolynomially with the size 
of the circuit, our algorithm is asymptotically more efficient than the previously known 
schemes [31]. Subsection 1.5.2 presents an 0{V 1 l 2 E\g 2 {Vd mB:x ))-t\me procedure for retim- 
ing a circuit with a clock period that does not exceed the minimum by more than an additive 
factor of c? max — 1. 

In Section 1.6 we extend our characterization of the minimum clock period to encompass 
combinational circuits, that is, circuits with no directed cycles in their graphs. We show how 
to pipeline any combinational circuit in 0(Elgd maK ) steps in order to achieve a specified 
latency with the minimum possible clock period. This result improves the 0(VE IgV) 
bound that can be obtained by applying the general algorithms from [31], and it is optimal 
within a constant multiplicative factor for circuitry with unit-delay functional elements. 



1.2 Retiming and Shortest Paths 

In this section we define some notation and terminology needed in this chapter. We for- 
mulate retiming according to [31] as a set of difference constraints, and we introduce the 
notion of the constraint graph. Finally, we exhibit the relation between retiming and the 
existence of single-source shortest-paths in the constraint graph. 

Given a circuit graph G = (V,E,d,w), we define the path weight w(p) for any path 
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p = v -4 v 1 -V . . . -4 1 i; fc in the circuit as the sum of the edge- weights of the path: 

k-i 

w (p) = J2 w ^- 

i=0 

We also define the path delay d(p) as the sum of the delays of the vertices in the path: 

k 

d (p) = J2 d ( v ^- 

i=0 

A retiming of a circuit G with gain r is an integer- valued vertex-labeling r : V — > Z. 
The retiming r specifies a transformation of the original circuit in which registers are added 
and removed so as to change the original circuit G into a new circuit G r = (V,E,d,w r ) 
with clock period <5(G r ). The edge- weighting w r is defined for an edge u -4 v of G r by the 
equation 

w r (e) = w(e) + r(u) — r(v). 

The following theorem characterizes the conditions under which we can find a retiming that 
produces a circuit with clock period no greater than a given constant. 

Theorem 1 ([31]) Let G = (V,E,d,w) be a synchronous circuit, let c be an arbitrary 
positive real number, and let r be a function from V to the integers. Then, r is a legal 
retiming of G such that $(G>) < c if and only if 

r(v) — r(u) < w(e) (1-1) 

for every edge u — * v in G, and 

r(v) - r(u) < W(u, v) - 1 (1.2) 

for all vertices u,v G V such that D(u,v) > c, where 

W(u,v) = min \w{j>) : u ~> v >, 

D(u, v) = max < d(p) : u ^ v and w(p) = W(u, v) > . 

Inequality (1.1) guarantees that the number of registers on every edge u — * v of the retimed 
circuit G r is nonnegative. Inequality (1.2) enforces the clock period constraint: every simple 
path u — > v with delay d(p) > c will have at least one register in the retimed circuit G r . 
There are potentially 0(V 2 ) inequalities of the form (1.2), one for each pair of vertices in 
G, and they can be computed in 0(VE + V 2 lgF) steps [31]. 

The constraints (1.1) and (1.2) in Theorem 1 are linear inequalities involving only dif- 
ferences of the unknowns r(v). Therefore, the retiming problem can be expressed in the 
following general form. 
Problem DC (Difference Constraints) Let L be a set of m linear constraints of the form 

x(v) — x(u) < t(u, v) (L) 
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on the n unknowns x(l),x{2), . . . , x(n), where t(u, v) are given integer constants. Determine 
a set of integer feasible values for the unknowns x(u) or determine that no such set exists. □ 

The following theorem is classic in the field of combinatorial optimization [25, 41], and 
provides a method for solving Problem DC. 

Theorem 2 Let L be a set of difference constraints, and let G = (V, E, t) be the directed, 
edge-weighted, constraint graph constructed from L in the following way: 

V = {u : x(u) is an unknown of L}; 

E = {u — > v : x(v) — x(u) < t(u,v) is a constraint in L}; 
t(u,v) = t(u,v), for every edge u — > v in E. 

Then, Problem DC is feasible if and only if there exists no directed cycle C G G with edge- 
weight t(C) < 0. Moreover, let every vertex v G V be reachable from a vertex s G V by a 
path s ~> v in G. If there exists a solution r to the shortest-paths problem in G from the 
source s, that is, if r(v) = mm{t(p) : s — > v € G} for every vertex v G V, then r is also a 
solution for the constraint set L, such that x(v) — x(s) is maximized for every vertex v G V. 

We denote by G c = (V, E c ,w c ) the constraint graph that corresponds to Inequalities (1.1) 
and (1.2) for a given c. Theorem 2 implies that a retimed circuit with clock period no 
greater than c can be computed in 0(V 3 ) steps by applying the 0(VE)-time shortest- 
paths algorithm by Bellman and Ford [25, page 74] on the dense constraint graph G c . An 
asymptotically faster algorithm which runs in 0(VE) time appears in [31]. 

1.3 Characterization of Minimum Clock Period Q m [ n (G) 

1.3.1 Bounds on $ min (G) 

In this section we characterize the minimum clock period $ min (G) that can be obtained by 
retiming a given circuit G = (V, E, d, w) in terms of the maximum delay-to-register ratio 
of the cycles in the circuit graph G and the maximum propagation delay of the circuit 
components. 

First, we give some definitions that will allow us to state and prove our results formally. 
We define the delay-to-register ratio R{C) of a cycle C = v -4 v x -* . . . -^ 2 v k _i -^ v in 
the circuit G as follows: 

vec 



R(C) 



eec 



We denote by C*(G) the directed cycle in G with maximum delay-to-register ratio. By 
definition, R(C*(G)) > R(C) for every cycle C G G. Finally, we denote by $ min (G) the 
smallest possible clock period that we can achieve by retiming G: 

^min(G) = min{$(G r ) : r is a retiming of G}. 

Our first theorem bounds the range of the minimum clock period of a circuit. 
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Figure 1-2: Path p % used in the proof of the lower bound. Note that only the first and the last 
edge in the path have non-zero register count. 

Theorem 3 Let G = (V,E,d,w) be a synchronous circuit with maximum delay-to-register 
ratio R(C*(G)), and let $ min (G) be the minimum clock period we can obtain by retiming G. 
Then 

\R(C*(G))-\ < $ min (G) < \R{C*{G))-\ + d max - 1, 

where <i max = max{d(t>) : v G V}. 

The proofs of the lower and the upper bound are given in Sections 1.3.2 and 1.3.3 respec- 
tively. Observe that both the upper and the lower bound are independent of the number of 
vertices and the number of edges in the circuit. 

For unit-delay circuits, the bounds in Theorem 3 yield an exact characterization of the 
minimum clock period. 

Corollary 4 Let G = (V, E, l,w) be a unit-delay synchronous circuit with maximum delay- 
to-register ratio R(C*(G)), and let $ min (G) be the minimum clock period we can obtain by 
retiming G. Then 

\R(C*(G))] = $ min (G). 

Proof. Follows directly from Theorem 3 for d max = 1. □ 

As we shall see in Section 1.4, this property of unit-delay circuits allows us to derive asymp- 
totically more efficient schemes for their optimization. 

1.3.2 Lower Bound 

In this section we prove the lower bound of Theorem 3. Specifically, we prove the following 
lemma. 

Lemma 5 Let G = (V, E, d, w) be a synchronous circuit with maximum delay-to-register 
ratio R(C*(G)), and let $ min (G) be the minimum clock period we can obtain by retiming G. 
Then 

\R(C*(G))-\ < $ min (G). 
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Proof. Assume that we have retimed G in such a way that it achieves the minimum possible 
clock period $ min (G). Let C* be the cycle with maximum delay-to-register ratio in G (see 

Figure 1-2), and let N = {e G C* : w r (e) > 0}. Now, consider a path p 1 = v l -4 v\ -4 

. . . ^-? v]_ 1 ^-> v\ in C*, such that e % , e]_ 1 G N and e l - ^ N, for j = 1, 2, ...,/ — 2. Observe 
that only the first and the last edge in the path p* have registers on them. Now, by definition 
of the clock period, the register- free part ofp* satisfies 

1-1 
£d(«j)<$ min (G). 

i=i 

There are \N\ paths around C* that have the form of p\ By summing up the \N\ 
corresponding inequalities for $ min (G), we obtain 

vec* 



< 


$min(G) ■ 


\N\ 


< 


$min(G) • 


J2 w ^ e 

VegC* 


< 


$min(G) • 





since to r (e) = w(e) + r(u) — r(v) for every edge u — * v, and the sum J2e€C* w r( e ) telescopes. 
Since the propagation delays are integers, 3? m i n (G) must also be an integer, and therefore 



vec* 



w{e) 
eec* 



< $mi„(G). 



E 

eec* 
The lemma follows directly from the definition of R(C*(G)). □ 

1.3.3 Upper Bound 

In this section we prove the upper bound of Theorem 3. Our proof uses properties of the 
constraint graph G c that was introduced in Section 1.2, and Lemma 6, whose correctness 
follows directly from Theorem 2 and the definition of the constraint graph. 

Lemma 6 Given a circuit G = (V,E,d,w) and a real number c, there exists a retiming r 
of G such that $(G>) < c if and only if the constraint graph G c has no negative edge-weight 
cycles. □ 

We proceed with the proof of the upper bound. 

Lemma 7 Let G = (V, E, d, w) be a synchronous circuit with maximum delay-to-register 
ratio R(C*(G)), and let $ min (G) be the minimum clock period we can obtain by retiming G. 
Then 

$min(G)< \R{C*{G))\+d m „-\. 
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w(e 3 ) /^~\ w(e ) 




D 




(a) 

Figure 1-3: (a) The cycle C~ in G c . The solid lines indicate edges in 5. For simplicity, we 
have not indicated the registers on these edges. The broken lines indicate edges in S 1 . (b) After 
the introduction of the edges in S" that correspond to the edges in S 1 , the solid cycle has edges 
exclusively in G and its delay-to-register ratio is greater than R(C*(G)). 



Proof. The proof is by contradiction of the fact that R(C*(G)) is the maximum delay- 
to-register ratio in G. Let us assume that there does not exist a retiming r, such that 
®(G r ) < fi?(C* (G))"! + d max — 1. Equivalently, according to Lemma 6, we assume that for 
c = \R(C*(G))~\ + d max — 1 there exists a negative edge- weight cycle C~ in the constraint 
graph G c = (V,El)E',w c ) (see Figure l-3(a)), where E' denotes the edges of G c introduced 
due to Inequality (1.2). We can partition the edges of C~ into C~ = S U S', where S C E 
and 5" C E'. Since the edge- weights w c (e) are integral, we have 



e€S ee.S' 



fl.3 



Now, Inequality (1.2) implies that every edge u — > v in E' with weight w c (e) = W(u,v) — 1 
corresponds to a path u — > v in E with weight w c (p) = W(u,v) and delay d(p) > c. Let 
S" = {v 1 — ► t> 2 : « — > t> G 5' corresponds to path p G G,v 1 — > t> 2 G p} (see Figure l-3(b)). 
Then, using Inequality (1.3) we obtain 

Y^w(e)+ Y, w ( e ) : 

eeS eeS" 



< 



J2 w ( e ) 


+ E 


u;(e) — 


S'| 


+ 


\S' 


e€S 


eeS" 










Y^ w ( e ) 


+ E w ^ e ) + 


|5'| 






e€S 


eeS' 










E^( e < 


> + E 


w c (e) + 


■|5' 


| 




e€S 


eeS' 










\S'\-1. 













Note that \S'\ > 2, because otherwise the cycle S U 5" would have no registers, which 
contradicts the fact that G is synchronous. Now, for the delay-to-register ratio of the cycle 
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UD-Retiming(G) 

1 for each edge e G E 

do w'(e) <— min{tt)(e), 



2 § mi n(G') 



\c\ 

max == 



*1} 



> $ min (G') = \R(C*(G'))] 



eeC 

3 for each vertex v G V 

do r(t>) <— [length of single-source shortest-path s ~»d in G' - l/$ min (G')] 

4 for each edge u —* v G E 

do to r (e) <— tt)(e) + r(«) — r(v) 

Figure 1-4: Algorithm UD-Retiming for optimal retiming of unit-delay circuitry. Given a unit- 
delay circuit G = (V,E, l,w), this algorithm determines a retimed circuit G r with minimum clock 
period. 



S U S" in G we have: 

yj * ^ Q d{u) + y , „„(/(«) 

V] e w(e) + y e w(e) 



> 



> 



> 



> 



|5'|-1 

|5'|-1 
|5'|(c + l-d m «) 



IS' 



IS'I 



IS' 



-i?(C*(G)). 



Since |S"|/(|S"| — 1) > 1, we conclude that there exists a cycle in G with delay-to-register 
ratio greater than the maximum delay-to-register ratio R(C*(G)), which is a contradiction. 
Therefore, G c has no negative edge-weight cycles and, according to Lemma 6, there exists 
a retiming r of G such that Q(G r ) < \R(C*(G))] + d m&x — 1. Consequently $ m j n (G) < 
|-i2(C*(G))l+d m «-i. ...-. ' " n 



1.4 Optimal Retiming of Unit-Delay Circuitry 

In this section we describe an efficient algorithm for optimal retiming of edge-triggered 
circuitry in the special case where all combinational logic blocks have unit propagation 
delays. Specifically, we give an 0(V 1 / 2 ElgV)-time procedure for the following problem: 
Given a unit-delay edge-triggered circuit G = (V, E, l,w), determine a retiming r such that 

$(Gr) = $ min (G). 

Our Algorithm UD-Retiming for optimal retiming of unit-delay circuitry is illustrated 
in Figure 1-4. The basic idea behind it is the construction of a circuit G' = (V,E,l,w') 
with small edge-weights w'(e), such that r is a retiming of G' with clock period <5(G>) if 
and only if r is a retiming of G with clock period $(G>) = Q(G' r ). Optimal retimings of G 
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can be computed more efficiently on G 1 using procedures that scale the small edge-weights 
w'(e). 

In order to prove the correctness of Algorithm UD-Retiming, we must show that the 
optimal clock period $ min (G) equals the optimal clock period $ min (G') computed in Step 2 
of the algorithm. This equality is a special case of the following general theorem. 

Theorem 8 Let G = (V,E,d,w) be an edge-triggered circuit, and let G' = (V,E,d,w') be 
the edge-triggered circuit with w'(e) = min{-u>(e), |V|d max } for every edge u — > v G E. Let 
R(C*(G)) and R(C*(G')) be the maximum delay-to-register ratios of G and G' respectively. 
Then 

\R(C*(G))] = \R(C*(G'))] . 

Proof. In order to prove the theorem, we first argue that the maximum R(C(G)) is obtained 
for a simple cycle in G. Indeed, if for any non-simple cycle C = C 1 UC 2 with R{Ci) > R(C 2 ) 
we have R(C) > R(Ci), then a straightforward calculation shows that R{Ci) < R(C 2 ) which 
contradicts our assumption about R{Ci) and R(C 2 ). 

Now, we show that every simple cycle C G G must satisfy 



eeC 



J2d(v) 



(1.4) 



If w(e) < \V\d max for every e G C, then w(e) = w'(e) for every e G C and equation (1.4) 

holds. If w(e) > \V\d m&x for some edge e G C, then w'(e) 

have 



V\d m &x- Since \C\ < \V\, we 



J2d(v) 

v€C 



J2d(v) 

v€C 



1. 



Therefore, equation (1.4) holds again. 

For unit-delay circuitry, it follows that $ min (G) equals $ min (G') 



□ 



Corollary 9 Let G = (V,E,l,w) be a unit-delay circuit, and let G' = (V,E,l,w') be the 
unit-delay circuit with w'(e) = min{-u>(e), |V|} for every edge u — > v G E. Let $ min (G) 
and $ min (G') be the minimum clock periods that can be achieved by retiming G and G' 
respectively. Then 

$ min (G) = $ min (G'). 



Proof. Follows directly from Corollary 4 and Theorem 8 for d n 



a 



Step 3 of Algorithm UD-Retiming computes an optimal retiming of G' , since a retiming 
that achieves a given clock period c can be computed for any unit-delay circuit G by 
rounding-up the shortest-paths lengths in the graph G — 1/c = (V,E,w — 1/c) with edge- 
weight w(e) — 1/c for each edge e G E [31]. Now, since w'(e) < w(e) for every edge 
e G E, a retiming of G' with clock period <&(G>) * s a ^ so a retiming of G with clock period 
$(G r ) = Q(G' r ). Therefore, Step 4 correctly computes an optimally retimed G r . 
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Opt-Retiming(G) 

1 for each edge u —* v G E 

do w'(e) <- mm{w(e), \V\d maK } 

2 Binary search [1, |F|d max ] for smallest integer n, t> n = \R(C*(G'))~\ 
such that G' — d/n has no negative edge-weight cycles. 

3 Binary search [n, n + c? max — 1] for smallest integer period 
that can be achieved by a retiming r of G. 

4 for each edge u —* v G E 

do w r (e) <— w(e) + r(«) — r(v) 

Figure 1-5: Algorithm Opt- Retiming for optimal retiming of edge-triggered circuitry. Given an 
edge-triggered circuit G = (V, E, d, w), this algorithm determines a retimed circuit G r with minimum 
clock period. 

Algorithm UD-Retiming terminates in 0(V 1 l 2 E\gV) time. Steps 1 and 4 can be com- 
puted in O(E) time. In Step 2, since every functional element in G' has unit delays, we 
have R(C*(G')) = l/mcm(G'), where 

mem G ) = mm 



ceG> \C\ 

is known as the minimum cycle-mean of G' . Ahuja and Orlin [39] have presented an al- 
gorithm for computing the minimum cycle-mean of a graph in 0(V 1 l 2 E\g(VW)) steps, 
where W is the maximum edge- weight in the graph. 1 Since w'(e) < \V\ for every edge 
e G E, we can use this algorithm in Step 2 to compute $ min (G') in 0(V 1 l 2 E\gV) time. 
The shortest-paths lengths in Step 3 can also be computed in 0(V 1 / 2 ElgV) time, using an 
0(V 1 / 2 Elg(VW))-time algorithm for shortest-paths by Gabow and Tarjan [12]. Thus, the 
total running time of Algorithm UD-Retiming is O(E) + 0{V 1 l 2 E lg V) +0{V 1 l 2 E lg V) + 
0{E) = 0{V 1 l 2 E\gV). 



1.5 Retiming of Circuitry that Includes Non-Unit Delays 

1.5.1 Optimal Retiming 

In this section we describe an 0(V.Eig(i max )-time algorithm for the optimal retiming prob- 
lem in its general form: Given an edge-triggered circuit G = (V,E,d,w), determine a re- 
timing r such that $(G>) = $ min (G). An 0(VElg V r )-time algorithm that binary searches 
the 0(V 2 )-size set of all possible clock periods for the minimum feasible one appears in [31]. 
Our procedure binary searches an interval with only c? max possible clock periods, and it is 
more efficient than that in [31], assuming that (i max grows subpolynomially with respect to 
the number of functional elements in the circuit. 



This W should not be confused with W(u,v) in Theorem 1. 
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Approx-Retiming(G) 

1 for each edge u —* v G E 

do w'(e) <- min{w(e), |F|d max } 

2 Binary search [1, |F|d max ] for smallest integer n, t> n = \R(C*(G'))~\ 
such that G' — d/n has no negative edge-weight cycles. 

3 for each vertex v G V 

do r(w) <— [length of single-source shortest-path s ~»d in G' - d/n] 

4 for each edge u —* v G E 

do to r (e) <— tt)(e) + r(u) — r(v) 

Figure 1-6: Algorithm Approx-Retiming for retiming with approximately minimum clock pe- 
riod. Given an edge-triggered circuit G = (V,E,d,w), this algorithm determines a retimed circuit 
G r with clock period <fr(G r ) < 3> m i n (G) + rf max — 1. 

Our Algorithm Opt-Retiming is illustrated in Figure 1-5. In Steps 1 and 2, the al- 
gorithm computes the maximum delay-to-registers ratio \R(C*(G'))~\ of the circuit G' = 
(V,E,d,w') with w'(e) < \V\d m&x . This ratio equals the smallest integer n in the in- 
terval [1, |F|d max ] of possible ratios that does not induce negative edge-weight cycles in 
the graph G' — d/n = (V,E,w' — d/n) with edge-weight w'(e) — d(v)/n for each edge 
u — > v G E [25]. Step 3 of the algorithm binary searches the integers in the interval 
[\R(C*(G'))] , |\R(C*(G'))1 +dmax - 1] for the minimum achievable clock period $ min (G). 
Theorems 3 and 8 and the integrality of the propagation delays guarantee that $ min (G) is 
an integer in this interval. The retiming r that corresponds to $ min (G) is used in Step 4 to 
compute an optimally retimed G r . 

Algorithm Opt- Retiming runs in 0(VElgd m&x ) time. Step 1 completes in O(E) time. 
Negative-weight cycles in Step 2 can be detected by solving a single-source shortest-paths 
problem on the edge- weighted graph G' — d/n [25]. Gabow and Tarjan have given an 
0(V 1 l 2 E\g (VW))-tim.e algorithm for the single-source shortest-paths problem, where W 
is the maximum edge- weight in the graph [12]. Thus, the binary search in Step 2 can 
be performed in 0(V 1 ^ 2 Elg (Vd m&x )) time, since w'(e) < \V\d m&x for every edge e G E. 
Step 3 utilizes the 0(VE) retiming algorithm by Leiserson and Saxe [31] to test whether 
a potential clock period is feasible. Thus, a retiming that achieves $ min (G) is computed 
in 0(VElgd m&x ) time, and the optimally retimed circuit is computed in Step 4 in 0(E) 
time. The overall running time is O(E) +0(V^ 2 Elg 2 (Vd m&x )) +0(VElgd m&x ) +0(E) = 
0(VE\gd m&x ). 

1.5.2 Approximately Optimal Retiming 

In this section we give an 0(V 1 l 2 E\g (V(i max ))-time algorithm for retiming of a circuit so 
that its clock period is approximately minimized. Specifically, we consider the following 
problem: Given an edge-triggered circuit G = (V, E, d, w) determine a retiming r such that 

$(G r ) < $ min (G) + d m ax " 1. 

Our Algorithm Approx-Retiming for retiming with approximately minimum clock 
period is illustrated in Figure 1-6. The first two steps of the algorithm are the same as 
those in Algorithm Opt-Retiming. In Step 3, a retiming r is obtained simply by rounding- 
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up the shortest-paths lengths in the graph G' — d/n = (V, E, w' — d/n) with edge-weights 
w'(e) — d(v)/n for each edge u —* v G E, where n = \R(C*(G'))]. The following theorem 
shows that the period $(G>) of the retimed circuit G r does not exceed $ min (G) by more 
than d max - 1. 

Theorem 10 Let G = (V,E,d,w) be a circuit graph with maximum delay-to-register ratio 
R(C*(G)) and let $ min (G) be the minimum clock period we can obtain by retiming G. Let 
G' = (V,E,d,w') be the circuit with w'(e) = min{-u>(e), |V|e? max } for every edge u — > v G E. 
Moreover, let n = \R{C*{G'))~\, and let I be the solution of a single-source shortest-paths 
problem on the graph G' — d/n = (V,E,w' — d/n) with edge-weight w'(e) — d(v)/n for each 
edge u — > v G E. Then, the assignment r(v) = \l{v)~\ for each vertex v G V is a retiming of 
G such that 

$(G r )<$ min (G) + d max -l. 

Proof. Before we proceed with the proof, we note that G' — d/n has no negative edge- weight 
cycles, since n = \R(C*(G'))~\. Therefore, the shortest-paths lengths l(v) in G' — d/n are 
well-defined. Now, in order to prove that r(v) = \l(v)~\ is a legal retiming of G with clock 
period $(G>) < $ min (G) + d m&x — 1, we must show that w r (e) = w(e) + r(u) — r(v) > for 
every edge u — > v € G r , and that every path p G G r with delay d(p) > $ m i n (G) + c? max — 1 
has at least one register. 

First, we prove that the assignment r(v) = \l(v)~\ for each vertex v G V satisfies w r (e) > 
for every edge e in the retimed circuit G r . Since I is a single-source shortest-paths solution 
on (V, E, w' — d/n), we have l(v) < l(u)+w'(e) — d(v)/n for every edge u —* v in E. Therefore 

rzwi-rwi < \i(v)-i(u)] 

< \w'(e)-d(v)/n\ 

< w'(e) 

< w(e), 

since [a;] — \y\ < \x — y] for every real x,y, and since w(e) is an integer. It follows that 
w r (e) = w(e) + \l(u)] - \l(v)~\ > 0. 

Now, we show that the assignment r(v) = \l(v)~\ for each vertex v G V satisfies the clock 
period constraint. Consider any path p = u -4 ui -+ . . . -4 u k _i -4 u k in the retimed 
circuit G r with delay J2i=o d(ui) > & mm (G) + c? max — 1. For this path we have 
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since $ min (G) > \R(C*(G'))~\ from Theorems 3 and 8, and since d m&x > d(u ) by definition. 
Therefore, the number w r (p) of registers along the path p in the retimed circuit satisfies 
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The last inequality implies that there exists at least one register along p, and therefore the 
clock period constraint is met. □ 

Algorithm Approx-Retiming terminates in O (V 1 I 2 E lg 2 (V d max )) time. Step 1 com- 
pletes in O(E) time. Steps 2 and 3 complete in O (V 1 ^ 2 E lg 2 (V d max )) time, since nega- 
tive edge-weight cycles can be detected in 0(l /1 / 2 £lg(V r (i max )) time using the single-source 
shortest-paths algorithm by Gabow and Tarjan [12]. Step 4 terminates in O(E) time, and 
thus the total running time is 0(V 1 l 2 E\g(Vd m3:x )). 



1.6 Optimal Pipelining of Combinational Circuitry 

In this section, we describe an 0(Elgd max ) algorithm for the problem of pipelining com- 
binational circuitry with the minimum possible clock period. Our result improves the 
0(VE lg V) bound that can be obtained by applying the general retiming algorithm from [31], 
and it is optimal within a constant multiplicative factor for unit-delay circuitry. In contrast 
to the algorithm in [31] that computes all possible clock periods, our algorithm exploits 
the special structure of a combinational circuit to identify an interval of c? max integers that 
contains the optimal period. 

In a combinational circuit the graph is acyclic and has an input interface v in and an 
output interface v out . Initially, the circuit is assumed to have no registers. By retiming a 
combinational circuit G we add registers to the circuit in such a way that the retimed circuit 
G r achieves a shorter clock period at the cost of introducing a latency oir(v in ) — r(v out ) clock 
ticks for the signals to propagate from the input interface v in to the output interface v out . 
The problem of minimum clock period pipelining is defined as follows: Given a combinational 
circuit G = (V, E, d, 0) and a positive integer /, determine a retiming r such that G r is a 
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MPP(G,Z) 

1 Compute the delay A of the longest path p A in G from v in to w „f 

2 Use Algorithm MLP to binary search -^ , -^ + (i max for the minimum integer 
period achievable with latency at most /. 

Figure 1-7: Algorithm MPP for minimum period pipelining. Given a combinational circuit G = 
(V, E, d, 0} with input interface V{ n and output interface v out , and a positive integer I, this algorithm 
determines a retiming r such that G r is a pipelined combinational circuit with latency I and minimum 
clock period. 

pipelined combinational circuit with latency at most I and with minimum clock period. 

Our Algorithm MPP for minimum clock period pipelining of combinational circuitry 
is illustrated in Figure 1-7. Step 1 of the algorithm computes the delay A of the longest 
path v in ~> v out in O(E) steps by traversing the vertices of G in topological sort order [5]. 
Step 2 binary searches an interval of (Z max + 1 integers for the minimum achievable clock 
period $ min (G). In each iteration of the search, the 0(.E)-time Algorithm MLP generates 
a pipelined circuit that achieves the clock period under consideration with the minimum 
possible latency. The search ends when the shortest period that can be achieved with 
latency at most I has been identified. Thus, Algorithm MPP terminates in 0(.Elg(Z max ) 
steps. 

The correctness of Algorithm MPP follows directly from Theorem 11 that bounds the 
minimum achievable clock period $ min (G) and the correctness of Algorithm MLP for min- 
imum latency pipelining. First, we prove the bounds on $ min (G). The proof relies on the 
constraint graph G c , which in this case has been augmented by an edge v out — > v in of weight 
I in order to account for the desired latency of the circuit. 

Theorem 11 Let G = (V,E,d,0) be a combinational circuit with input interface v in and 
output interface v out . Let A be the delay of the path p A = -^v in v out in G with the longest 
propagation delay, and let I be a positive integer. Then the minimum clock period $ min (G) 
for any pipelined version of G with latency I satisfies: 
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Proof. According to Theorem 1, every retiming r that yields a pipelined version of the 
original circuit with clock-period at most c must satisfy 

r(v)-r(u)<0 (1.5) 

for every edge u — > v in E, and 

r(v)-r(u)<-l (1.6) 

for all vertices u, v G V connected by a path p G G with delay d(p) > c. In addition, it 
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must satisfy a latency constraint 



r(v in ) - r(v out ) < Z, 



(1.7) 



where Z is an upper bound on the latency of the pipelined circuit. Inequalities (1.5), (1.6), 
and (1.7) induce a constraint graph G c that is described in the statement of Theorem 2. 
Note that Inequality (1.7)) introduces an edge v out — + v in in G c with weight w c (ei) = Z, and 
that e; is included in every cycle in G c . 

First, we derive the lower bound on $ min (G). Let r be a retiming of the circuit with 
latency Z and clock period $ min (G). Adding up the delays of the Z + 1 register-free parts 
along the path p A yields A < <f> min (G)(l + 1). It follows that $ min (G) > [A/(Z + 1)], since 
the propagation delay d(v) is an integer for every vertex v G V. 

We establish the upper bound on $ min (G) by proving that [A/ (Z + 1)] +d m&x is a feasible 
clock period. According to Lemma 6, it suffices to show that G c has no negative edge- weight 
cycles for c = |"A/(Z + 1)] + d m&x . The only negative- weight edges of G c have weight — 1. 
The maximum number of such edges in any path from v in to v out is 



(rA/(/ + i)i+d n 



< 



\A/(l + 1 
A-l 

A/(/ + l). 

, ■, l + 1 



< I 



since I + 1 > 0. Every cycle in G c must use e ; with w c (ei) = Z, 
inequality implies that G c has no negative edge- weight cycles. 



and therefore the last 

□ 



Algorithm MPP employs a subroutine for pipelining a combinational circuit to achieve 
a specified period c using the minimum possible number of stages in the pipeline. In 
mathematical terms, this subroutine computes a solution r for Inequalities (1.5) and (1.6), 
such that the latency r(v in ) — r(v out ) is minimized. According to Theorem 2, a solution r 
is given by the lengths of the single-source shortest-paths in the possibly dense constraint 
graph G c = (V,E c ,w c ) that is induced by Inequalities (1.5) and (1.6). 

Algorithm MLP, which is described in Figure 1-8, finds a minimum latency pipelining of 
a combinational circuit by a single sweep over its circuit graph G. The intuitive idea behind 
the algorithm is to visit the vertices of G while keeping track of the longest propagation 
delay 8(v) along any combinational path u ~» v in G up to the currently visited vertex 
v. New registers are inserted greedily: whenever 6(v) exceeds the desired clock period c, a 
pipeline stage is introduced. Algorithm MLP terminates in O(E) steps, since each edge is 
examined twice, and as we prove in the following lemma it returns a correct answer. 

Lemma 12 The assignment r computed by Algorithm MLP is a solution to the single- 
source shortest-paths problem on the constraint graph G c defined by Inequalities (1.5) and (1.6). 

Proof. We prove that r(v) gives the length of the shortest path v in ~» v in the constraint 
graph G c = (V,E c ,w c ), and that 6(v) gives the maximum propagation delay among all 
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MLP(G,c) 

1 for each vertex v G V 

2 do r(v) <- 

3 6(v) <- d(w) 

4 for each vertex v E V in topological sort order 

5 do for each edge u — > v G E 

6 do if 8(u) + d(v) > c 

7 then r(v) <— min{r(v),r(u) — 1} 

8 else r(t>) <— min{r(t;),r(ii)} 

9 for each edge u — > v G E 

10 do if r(t>) = r(«) 

11 then 8(v) <— max{6(t;),6(«) + d(v)} 

Figure 1-8: Algorithm MLP for minimum latency pipelining. Given a combinational circuit G = 
(V, E, d, 0} and a desired clock period c, this algorithm determines a pipelined combinational circuit 
G r with clock period $(G r ) < c and minimum latency. 



paths u ~» i; in G with r(«) = r(t>). The proof is by induction on the number of vertices 
that have been visited by the algorithm. 

For the basis, the input interface v in is visited, and we have r(v in ) = and 6(v in ) = d(v) 
after initialization. Therefore, the lemma holds. 

For the inductive step, vertex v is visited after all preceding vertices u have been visited. 
We assume that 

r(u) = min{w c (p) :v m -^ue G c }, 

and 

8(u) = max{(i(p) lu'e^tt'^figG, r(u') = r(u)} 

for every vertex u preceding v in the topological order. Lines 5-8 perform a relaxation over 
all edges u — > v in E and set 

r(v) = min{{r(-u) - 1 : 8(u) + d(v) > c, u —> v G E} U {r(u) : 6(u) + d(t;) < c, « — >■ -u G E}} 
= min{w c (p) :v m -^ v e G c }, 

since 8(u) + d(t;) > c implies that there exists an edge u' —* v G E c such that r(u') = r(u) 
and w c (e) = — 1. Lines 9-11 set 

6(t>) = d(v) + meix{6(u) : « — > t; G .E, r(«) = r(v)} 
= max{(i(p) : u' G V, «' ' % -> v G G,r(u') = r(v)}, 

and therefore the lemma holds. □ 
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1.7 Conclusion 

In this chapter we presented tight bounds on the minimum clock period that can be achieved 
by retiming an edge-triggered circuit. We expressed these bounds in terms of the maximum 
delay-to-register ratio around the cycles in the circuit and the maximum propagation delay 
d mai of the combinational logic blocks in the circuit. Using these bounds, we characterized 
exactly the minimum clock period that can be achieved by retiming unit-delay circuits, and 
we designed improved algorithms for several problems related to retiming. Specifically, we 
presented an 0(V 1 l 2 E lg V r )-time algorithm for optimal retiming of unit-delay circuits. This 
is the asymptotically fastest algorithm known to date for this problem. For circuits that 
also include combinational logic blocks with non-unit delays, we gave an 0(VE lg (i max )-time 
algorithm for optimal retiming and an 0(V 1 ^ 2 Elg (V d m&x ))-time algorithm for retiming 
with a clock period that does not exceed the optimal by more than c? max — 1. Finally, we 
gave an 0(.Elg(i max )-time algorithm for optimal pipelining of combinational circuitry. 

An interesting open problem is the design of an algorithm for optimal retiming of circuits 
that include non-unit delay combinational logic blocks whose running time matches that of 
Algorithm UD-Retiming for optimal retiming of unit-delay circuits. 



Chapter 2 

Analyzing and Optimizing 
Level-Clocked Circuitry 



2.1 Introduction 

A VLSI designer often has the choice of whether to use level-clocked latches or the more 
conventional edge-triggered latches to implement clocked storage elements in his circuit. An 
edge-triggered latch directly supports the abstraction of a storage element that is synchro- 
nized by the ticking of a clock. When the clocking waveform rises (i.e., the clock ticks), an 
edge-triggered latch instantaneously samples its input and asserts that value on its output. 

A level-clocked latch operates somewhat differently. While the clock input to a level- 
clocked latch is low, the latch output maintains its value from the most recent time that 
the clock was high. While the clock is high, however, the input flows unimpeded to the 
output, unsynchronized with either edge of the clock. In order to avoid problems with race 
conditions, it is common for level-clocked circuits to adopt clocking disciplines which involve 
multiple clock waveforms, or "phases". 

In a two-phase clocking scheme [35], two clocking waveforms, or phases, denoted O and 
0i, are employed, as is shown in Figure 2-1. Formally, we denote a two-phase clocking 
scheme by a 4-tuple ir = {0o,7o,0i,7i) of strictly positive real numbers. In this context, O 
denotes the duty cycle of the first phase, i.e., the length of time during which the phase is 
high, while 70 denotes the gap of the first phase, i.e., the amount of time between a falling 
edge of the first phase and the next rising edge of the second phase, which generally must 
be long enough to overcome various engineering constraints, such as setup and hold times, 
the nonzero durations required for clock signals to rise and fall, and clock skew [14, 58]. The 
duty cycle and gap of the second phase are, similarly, denoted by 0i and 71, respectively. 
The ratio p = 0i/0 o is the duty ratio of the clocking scheme. We overload the symbol ir to 
denote the sum 

tt = 0o +7o +0i +7i, (2- 1 ) 



Parts of this chapter represent joint research with Alex Ishii and Charles Leiserson. 
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which is the period of the clocking scheme ir = (<^o,7o,<^i,7i). We also overload fa to 
denote both phase i and its corresponding duty cycle, and 7, to denote both gap i and its 
corresponding duration. Observe, that since the values 7 and 7j are strictly greater than 
0, phases fa and fa are nonoverlapping (not simultaneously high). 

In this chapter we give an efficient algorithm to verify the proper timing of circuits that 
employ two-phase clocking schemes, and we present several other algorithms for optimizing 
their clock periods. Since an edge-triggered latch can be implemented by two back-to- 
back level-clocked latches [14] our algorithms also provide an automatic way to take edge- 
triggered circuits and transform them into faster level-clocked ones. At the end of this 
chapter, we generalize most of our algorithms to level-clocked circuits that employ more 
than two clock phases. 

We model a circuit as an edge-weighted, vertex-weighted multigraph G = (V, E) in 
which V is a collection of combinational logic elements with associated propagation delays 
and E is the set of interconnections, each of which passes through zero or more latches. 
Each latch is clocked either by fa ox fa. A general framework for the timing verification of 
level-clocked circuits appears in [19, 20]. 

An example of a two-phase, level-clocked circuit is shown in Figure 2-2(a). The integers 
in the vertices signify propagation delays. For simplicity, let us assume that in the figure, 
we have 7 =71 =0. (In our mathematical development, we shall assume that the gaps 
7o and 7j are strictly positive, as is consistent with engineering situations, and because the 
assumption of gaps can raise some subtle, but tedious and largely irrelevant, difficulties.) 
If the two phases fa and fa have equal duty cycles, then the circuit cannot be clocked with 
a clock period shorter than 36 units of time, since the path DEA has propagation delay 
54, and intuitively, a datum has at most 3/2 clock periods to propagate from the latch 
preceding D to the latch succeeding A. 

The first problem that we consider in this chapter is the timing verification problem for 
two-phase circuits. We give an 0(VE)-time algorithm that verifies whether a level-clocked 
circuit is properly timed by a given two-phase clocking scheme. This result improves the 
0(E 2 ) bound obtained when the general algorithm from [20] is applied to the special case 
of two-phase, level-clocked circuits. (The bound in [20] is also 0(VE), but the circuit 
model used in that paper represented both functional elements and latches as vertices, 
and interconnections between them as edges. Translating to the model presented here 
yields the 0(E 2 ) bound. The algorithm in [20] applies to more general circuits and timing 
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Figure 2-1: A two-phase, nonoverlapping clocking scheme 7r = (^0,70,^1,71). 
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Figure 2-2: An illustration of the various techniques for optimizing two-phase, level-clocked cir- 
cuits. Part (a) of the figure shows a simple level-clocked circuit. When the duty cycles of the phases 
4>q and 4>i are equal, as is shown in part (e) of the figure, the clock period of circuit (a) cannot be 
made smaller than 36. By tuning the clocking scheme of circuit (a) to have a duty ratio of 4:7, as 
shown in part (f), a clock period of 33 can be achieved. Part (b) of the figure shows a retimed ver- 
sion of circuit (a). When clocked with the symmetric clocking scheme shown in part (g), circuit (b) 
achieves a clock period of 31, which is optimal for symmetric clocking schemes under any retiming. 
The optimal combination of retiming and tuning for circuit (a) is shown in part (c). When clocked 
with the waveforms in part (h), which have a duty ratio of 3:2, circuit (c) achieves a clock period 
of 30, which is the optimal for any combination of clocking scheme and retiming. Part (d) shows a 
typical level-clocked implementation of an edge-triggered circuit which is equivalent under retiming 
to the other circuits. Each pair of level-clocked latches in the circuit implements an edge-triggered 
latch. This edge-triggered circuit has a clock period of 36, which is the best that can be obtained 
by any retiming that is edge triggered. 
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methodologies than the ones considered here, however.) 

In addition to timing verification, we consider a collection of timing analysis problems for 
two-phase circuits. The algorithms we provide identify the critical paths of the circuit and 
provide information on the sensitivity of the circuit's timing to changes in the propagation 
delays of its combinational logic blocks. For the clocking scheme in Figure 2-2(e), for 
example, the path DEA in the circuit of Figure 2-2(a) is critical. The propagation delay of 
B can increase by 12, however, without violating the circuit's proper timing by the clocking 
scheme in Figure 2-2(e). We give an 0(VE)-t\me algorithm for the noncritical sensitivity 
analysis problem in which we wish to compute by how much we can increase the propagation 
delay of any given block of combinational logic without affecting the proper timing of the 
circuit by a given clocking scheme. We also give an 0(VE + V 2 lg V r )-time algorithm that 
solves this problem for all combinational logic blocks in the circuit. Another problem we 
investigate is the critical sensitivity analysis problem for combinational logic blocks that 
lie on critical paths. We present an 0(VE)-t\me algorithm that computes the minimum 
decrease in the propagation delay of any critical block that is required to remove it from 
the critical path. 

Our next result deals with modifying, or "tuning", the clocking scheme of a circuit — 
that is, providing the circuit with new clocking waveforms. The two clocking schemes in 
Figures 2-2(e) and 2-2(f), for example, illustrate tuning. Observe, that the circuit shown 
in Figure 2-2(a) is properly timed by either clocking scheme, but the first clocking scheme 
has a period of 36, while the second has a clock period of 33. The notion of clock tuning 
encompasses more than a simple increase in the frequency of the clock. In particular, 
clock tuning allows also for the adjustment of the duty ratio of the clocking scheme. In 
the example of Figure 2-2(a), the circuit cannot be properly timed by any clocking scheme 
whose period is less than 33, since the delay of the path CDEA is 66 and must be distributed 
over at most two full clock periods. The clocking scheme shown in Figure 2-2(f) is thus an 
optimal tuning for the circuit in Figure 2-2(a). 

The tuning problem for two-phase circuits is the problem of adjusting the phases of 
a clocking scheme so as to clock a given two-phase circuit as quickly as possible. We 
assume that the gaps 7 and ^y 1 must be kept fixed and only the duty cycles of the phases 
can be adjusted. We give an 0(VE)-time algorithm to solve the tuning problem. Previous 
algorithms for tuning have either addressed other types of clocking methodologies [9, 49, 57], 
or been uncharacterized with respect to worst-case running time [6, 51]. 

Another way to optimize a circuit is by retiming: a method for relocating latches within 
the circuit without affecting the functionality of the circuit. Retiming has been well studied 
in the context of edge-triggered circuits [28, 29, 31, 33, 45] and has been the subject of 
study in the context of single-phase, level-clocked circuits [52]. We extend the retiming 
technique to encompass the optimization of two-phase, level-clocked circuits. We consider 
three problems related to retiming. 

The retiming problem for two-phase circuits asks whether, for a given two-phase circuit 
G and clocking scheme ir, the circuit G can be retimed to be properly timed by it. As an 
example, consider the circuit in Figure 2-2(a). If we retime the circuit to be properly timed 
by the clocking scheme in Figure 2-2(g), we obtain the circuit in Figure 2-2(b). We provide 
an algorithm to solve the retiming problem that runs in 0(V 3 ) time. 

The retiming problem with symmetric clocking schemes is the common special case of 
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the retiming problem in which the two phases of the clocking scheme are identical and 180 
degrees out of phase. Such a symmetric clocking scheme, which in general has the form 
7r = (</>, 7, (j>, 7), is shown in Figure 2-2(g). We provide an algorithm that solves the retiming 
problem for symmetric clocking schemes in 0(VE + V 2 \gV) time. 

The retiming problem for minimum latch count asks for a retiming of a given circuit G 
that achieves a given symmetric clocking scheme ir with the minimum number of latches. 
We describe an algorithm that solves this problem in 0(V 3 lgF) time. 

We consider three problems related to both retiming and tuning. The first is a general 
optimization problem, and the other two are progressively more specialized. 

The retiming and tuning problem asks how a circuit can be retimed and its clock simul- 
taneously tuned to achieve the minimum clock period over all possible clocking schemes. 
For example, the optimal retiming of the circuit in Figure 2-2(a) yields the circuit in Fig- 
ure 2-2(c) with the clocking scheme in Figure 2-2(h). We provide an efficient approximation 
algorithm for this general tuning and retiming problem. For any given relative error e > 0, 
the approximation algorithm runs in 0(V 3 (l/e) lg(l/e) + (VE + V 2 lg V) \g(V/e)) time and 
produces a retimed circuit that is properly timed by a clocking scheme whose period is at 
most (1 + e) times the optimal. We also provide an 0(V 11 )-time algorithm that solves the 
general tuning and retiming problem exactly. 

The retiming and fixed- duty-ratio tuning problem is the special case of the retiming and 
tuning problem where we ask how a circuit can be retimed and its clock tuned to achieve 
the minimum clock period over all clocking schemes that have a given duty ratio. We 
give an algorithm that, for any given relative error e > 0, runs in 0(V 3 lg(V/e)) time. 
This algorithm can be adapted to solve the retiming and fixed-duty-cycle tuning problem in 
which the duty cycle of one of the two phases of the clock is given and the other can be 
adjusted. We also give an 0(V 2 E + V 3 lg V r )-time algorithm that solves the retiming and 
fixed-duty-ratio tuning problem exactly. 

When we require the two phases of the clocking scheme to be symmetric, we have a 
special case of the retiming and fixed-duty-ratio tuning problem that we call the retiming 
and symmetric tuning problem. For example, the optimal retiming of the circuit in Figure 2- 
2(a) with respect to symmetric clocking schemes yields the circuit in Figure 2-2(b) with the 
clocking scheme in Figure 2-2(g). We give an approximation scheme for the retiming and 
symmetric tuning problem that runs in 0((VE + V 2 lg V) IgiV/e)) time for any relative error 
e > 0. We also give an exact algorithm for the retiming and symmetric tuning problem that 
runs in 0(V 2 E) time. 

Our optimization techniques can be used not only to facilitate the design of level-clocked 
circuits, but also to convert edge-triggered circuits into faster level-clocked circuits. The 
basis for the conversion is the fact that an edge-triggered latch can be implemented by 
a pair of level-sensitive latches. In Figure 2-2(d) we illustrate the typical level-clocked 
implementation of an edge-triggered circuit. This circuit has a clock-period of 36, but the 
algorithms presented in this chapter can automatically produce the optimal level-clocked 
circuit in Figure 2-2(c), which, with the clocking scheme in Figure 2-2(h), is timed properly 
with a clock period of 30. As testimony to the additional power gained by level-clocking, 
observe that no edge-triggered retiming of the circuit from Figure 2-2(d) improves upon its 
period of 36. 

Most of the algorithms described in this chapter have been implemented in Tim, a 
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software package for two-phase, level-clocked circuitry that we describe in Chapter 3 [48]. 
We have already used Tim to compare empirically two-phase, level-clocked circuits and 
edge-triggered circuits in terms of speed and number of storage elements [47]. Tim provides 
interactive feedback to designers. For example, rather than simply reporting the minimum 
clock period of a circuit, it performs a "sensitivity analysis" that reports the extent to which 
noncritical propagation delays can be increased without affecting the clock period. 

The remainder of this chapter is organized as follows. 

Section 2.2 describes necessary and sufficient conditions for a two-phase, level-clocked 
circuit to be properly timed. Section 2.3 then simplifies these conditions and uses them in 
an 0(VE)-t\me algorithm that solves the timing verification problem. These conditions are 
also used by the sensitivity analysis algorithms that we present in Section 2.4. Specifically, 
we present an 0(VE)-t\me algorithm that solves the noncritical sensitivity analysis prob- 
lem for a single combinational block, and an 0(VE + V 2 lg V r )-time algorithm that solves 
the noncritical analysis problem for all combinational blocks. We also give an algorithm 
that solves the critical analysis problem for a single combinational block in 0(VE) time. 
By viewing the simplified necessary and sufficient conditions as a two-dimensional linear 
program, Section 2.5 shows how the tuning problem can be solved in 0(VE) time. 

Sections 2.6 and 2.7 describe how to solve the retiming problem for symmetric and 
general clocking schemes, respectively. The 0(VE + V 2 lg V r )-time algorithm for symmetric 
clocking schemes is based on reducing the retiming problem to an efficiently solvable mixed- 
integer linear program. The 0(V 3 )-time algorithm for the general case uses a technique 
we call integer monotonia programming. Both algorithms also determine when the given 
clocking scheme cannot be achieved by any retiming. Section 2.8 presents an 0(V 3 lgV)- 
time algorithm that solves the retiming problem for minimum latch count. 

Section 2.9 presents approximation algorithms for the three problems related to retim- 
ing and tuning. Specifically, we give an 0((VE + V 2 \gV) lg(V r /e))-time algorithm for the 
retiming and symmetric tuning problem, and an 0(V 3 lg(V r /e))-time algorithm for the re- 
timing and fixed-duty-ratio tuning problem, that achieve the optimal clock period to within 
any given relative error e > 0. These two algorithms can also be used to obtain the exact 
optimum in the special case where all propagation delays are integers. For the general 
retiming and tuning problem, we give an 0(V 3 (l/e) lg(l/e) + (VE + V 2 lg V) lg(V/e))-time 
approximation algorithm. 

Although we solve the three problems related to both retiming and tuning with efficient 
approximation algorithms, we have also discovered polynomial-time optimal algorithms for 
them. In Section 2.10 we describe an algorithm that solves the retiming and symmetric 
tuning problem optimally in 0(V 2 E) time, the retiming and fixed-duty-ratio tuning problem 
optimally in 0(V 2 E + V 3 lg V) time, and the general retiming and tuning problem optimally 
in 0(1 /11 ) time. These results are based on a characterization of the feasible clock periods 
of retimed level-clocked circuits. 

Section 2.11 extends many of our techniques to circuits that employ more than two 
phases. The algorithms for A;-phase circuits are generally at most a factor of k slower than 
the corresponding algorithms for two-phase circuits. Section 2.12 concludes by discussing 
how our algorithms can be generalized to handle design issues that arise in practice. Ap- 
pendix A.l provides a proof that the conditions we give for the proper timing of a two-phase, 
level-clocked circuit are correct. 
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In independent work, Lockyear and Ebeling [32] have also obtained algorithms for re- 
timing multiphase, level-clocked circuits. Their results include a polynomial-time algorithm 
for the symmetric retiming problem. They use this algorithm as a subroutine to solve the 
retiming and symmetric tuning problem. They also determine a set of constraints for the 
retiming problem, and they describe a Bellman-Ford-like algorithm for solving the con- 
straints. Algorithms for retiming single-phase, level-clocked circuitry have appeared in [52]. 
Retiming heuristics were given in [3]. 

Early versions of our work appear in [21, 46]. 

2.2 Constraints for Proper Timing 

In this section we give necessary and sufficient conditions for a two-phase, level-clocked 
circuit to be properly timed by a given clocking scheme. The section begins with a formal 
definition of the set of level-clocked circuits to which our results can be applied. We then 
precisely characterize the timing constraints that need to be satisfied by a properly timed 
circuit. These constraints are based on the general formulation from [20], but they are 
substantially simpler due to the additional structure inherent in two-phase, level-clocked 
circuits. 

Since we represent circuits in terms of graphs, we first define some graph notations. For 
a directed graph G = (V, E), we denote an edge e from a vertex tttoa vertex v by u — > v. 
If the edge name is unnecessary, we sometimes omit it. A path p from u to v is denoted 
by u ~> v. A path contains both its endpoints u and v. We shall use both edge and vertex 
weights. For an edge-weight function w and path p, the weight w(p) is just the weight of p's 
constituent edges. For a vertex weight d, the weight d(p) is the weight of the p's constituent 
vertices, including the weights of its endpoints. A cycle is a path that begins and ends with 
the same vertex v. For a vertex weight d, the weight d(c) of a cycle c includes the weight 
of each vertex on the cycle only once. 

We formally represent a two-phase, level-clocked circuit as a directed multigraph G = 
(V, E, d, w, x), where V is a collection of functional elements, E is the set of interconnections 
between functional elements, the (propagation) delay function d is a mapping from V to the 
nonnegative real numbers, the edge-weight function wisa mapping form E to the integers, 
and x '■ V — > {0, 1} is an assignment of a phase to each functional element. For a two-phase 
circuit G to be well formed, we require 

WF1. w(e) > for all e £ E; 

WF2. w(c) > for every cycle c £ G; 

WF3. for every edge u —* v £ E, 

w(e) - x(u) + x(v) = (mod 2) . 

(The weight of a cycle — or a path — is just the sum of the weights of its constituent edges.) 
In our circuit model, a vertex v £ V corresponds to a functional element whose maximum 
propagation delay is equal to d(v) and whose minimum propagation delay is zero. Level- 
clocked latches are not represented explicitly but rather, are represented implicitly by the 
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m^\/^\j^^W^W^\^-J^\y^^ 



T(A'v»E) 



\< t(B^*C) » 

Figure 2-3: The rise-to- fall time for the path A ^-> E is <p + 2(j + (j>i + 71 + (j> ) = 4> + 2ir. The 
rise-to- fall time for the path B ~> C is 4>\ + 71 + 4>q . Each of the squiggly lines in the top part of 
the figure represents a path of zero or greater combinational delay. 

weights w(e) of edges u A v G E. A value w(e) = indicates that a direct connection exists 
between the output of the functional element represented by u and an input of the functional 
element represented by v. A value w(e) > indicates that the connection between the two 
functional elements consists of w(e) level-clocked latches connected in series. 

The requirement that w(c) > precludes us from considering circuits with unclocked 
state. Since we represent latches implicitly in terms of weights on edges between functional 
edges, the phase that clocks a given latch in the circuit modeled by G is determined implicitly 



as well. If the phase of a vertex v is x(f ), it means that 



} x(v) 



clocks the last latch on any 



edge that ends at v . Condition WF3 — which by induction generalizes to 



WF3'. for every path u 



w{p) - x(u) + xO) = ° ( mod 2 ) 



— ensures that the latches on any path alternate between 4> and cj) 1 . Since any cycle c is 
also a path, its weight must be even, in addition to being positive. Hence, every cycle c 
in a two-phase circuit must contain at least two latches — one controlled by each of the two 
phases. 

We now turn to the issue of proper timing. Intuitively, a level-clocked circuit is properly 
timed if whenever a latch holds a value (its clock input is low), it holds the same value as in 
an identical circuit in which all functional elements have zero propagation delay. This notion 
of proper timing is "structural" , in the sense that we require that a circuit operate correctly 
regardless of the functions computed by the functional elements. This requirement avoids 
potential difficulties with computational intractability. The semantics of proper timing are 
studied in [19]. Ishii and Leiserson [20] give a set of "A-constraints" that serve as necessary 
and sufficient conditions for the proper timing of a general class of level-clocked circuits. 
For two-phase circuits, this general formulation reduces to a much simpler set of necessary 
and sufficient conditions. 

The conditions for proper timing of a circuit G = (V, E, d, w, \) are based on considering 
the operation of G when all propagation delays are 0. Let a be any path, not necessarily 
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simple, from a latch A to a latch B in the circuit that G represents. The rise-to-fall time 
t(ct) of a path a is the time it would take in the circuit for the effect of a rising edge of ^4's 
phase to propagate along a and be stored in B by the falling edge of £?'s phase. For example, 
the rise-to-fall time of path A^> E in Figure 2-3 is <f> + 2(-y + </>i +7i + 4>o) = 4>o + 27r. The 
rise-to-fall time of path B ~> C is 4> 1 + 7! + <f> . The following proposition uses rise-to- fall 
times to give conditions for the proper timing of a circuit. Its proof is given in Appendix A.l. 

Proposition 13 A two-phase, level-clocked circuit is properly timed if and only if for all 
latches A and B in the circuit, the propagation delay along any path from A to B is no 
greater than the rise-to-fall time of the path. □ 

Given Proposition 13, we can formulate a set of conditions for the proper timing of a 
two-phase, level-clocked circuit. Specifically, we have the following lemma. 

Lemma 14 Let G = (V,E,d,w,x) be a circuit that employs a clocking scheme tt = 
(</>o,7o, </>i,7i). Then G is properly timed by tt if and only if for every path u ~> v in 
G, we have 

1 + w(p) 
uypj ^ 7i i 

ifx(u) +X{v), and 



d( P )<*l ^]+^w (2-2) 



d(p)<*[l±p£)-*n- x < e) (2.3) 



ifx(u) =x(v). 

Proof. In order to prove the theorem, we shall make the straightforward extension of our 
graph notation from our simple model, in which latches are represented implicitly by edge 
weights, to the underlying circuit, in which latches are represented explicitly. 

(=>) The necessity of Inequalities (2.2) and (2.3) follows from Proposition 13. Consider 
any path u ~> v in G, and extend it at both ends to produce a path a = A~^> « ~» v ~> B in 
the circuit modeled by G, where A and B are latches and the subpaths A ~> u and v ^ B 
are latch free. Thus, A is clocked by 4> x ( u ) and B is clocked by </>i- x („). By definition, the 
number of latches on the path a is w(p) +2, and the total propagation delay along a is 
d(a) > d{j>). A case analysis now yields the desired result. 

If x( u ) ¥" x( v )i then A and B are both clocked by phase (f> x ( u ) = </ , i- x ( t) ) and w(p) is an 
odd number. Thus, the effect of a rising edge on .4's clock input takes (1 + w(p))/2 periods 
to reach B, plus an additional (f> x ( u ) for the time until the falling edge on B's clock input. 
Hence, we have 

. , fl+w(p)\ 
T ( a ) = ^ g J + ^x(«) 

+ <f>i-x(v) ■ ( 2 - 4 ) 

The propagation delay along a is at least d(p), since a contains p as a subpath. Hence, by 
Proposition 13, if the circuit is properly timed, we have d(p) < r(a) and the constraint (2.2) 
holds. 
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If x{ u ) = x( v ), then latch A is clocked by phase 4> x ( u ), latch B is clocked by </>i- x ( u ), 
and w(p) is an even number. Thus, the effect of a rising edge on .4's clock input takes 
(2 + w(p))/2 periods to reach B, minus the gap 7i- x („) following the falling edge on £?'s 
clock input. Hence, we have 

, x {2 + w(p)\ , N 

tH=tt -W -71-xH- (2-5) 



The propagation delay along a is at least e?(p), since a contains p as a subpath. Hence, by 
Proposition 13, if the circuit is properly timed, we have d(p) < r(a) and Inequality (2.3) 
holds. 

(•<=) To prove the sufficiency of Inequalities (2.2) and (2.3), assume that the circuit is 
not properly timed. Then Proposition 13 implies that there exists a latch-to-latch path 
A ~> B in the circuit with propagation delay greater than t(o). Without loss of generality, 
a has the minimum rise-to- fall time of any such path, and hence, latch A is directly followed 
by a functional element u and latch B is directly preceded by a functional element v. Thus, 
a = A — > u ^> v — > B, and we have d(p) > t(o). Using a case analysis similar to the 
one to prove necessity, we can conclude that either Inequality (2.2) or Inequality (2.3) is 
violated. □ 

The reader should note that the lemma requires that all paths in the circuit be consid- 
ered, not just simple ones. 

2.3 Verifying Proper Circuit Timing 

In this section we consider the verification problem for two-phase, level-clocked circuits: 
Given a two-phase, level-clocked circuit G = (V, E, d, w, \) and a clocking scheme ir = 
(</>o,7o, </>i,7i), determine whether G is properly timed by it. Figure 2-4 gives an 0(VE)- 
time algorithm TV for the timing verification problem. This bound improves the 0(E 2 ) 
bound that one obtains by applying the general verification algorithm in [20] to two-phase 
circuits. We analyze the running time of Algorithm TV and then prove its correctness. 
We first prove a bound on the running time of Algorithm TV. 

Lemma 15 Algorithm TV terminates in 0(VE) time. 

Proof. The circuit transformation of Step 1 can be computed in O(E) time. We can 
implement Step 2 to run in 0(VE) time as follows. First, we compute in O(E) time a 
topological sort [5, Section 23.4] of all edges e G E with w(e) = 0. We then execute a 
doubly nested loop, where the outer loop is indexed by i, and the inner loop is indexed 
by each e G E consistent with the topological sort order if w(e) =0 and in any order if 
w(e) > 0. Within the doubly nested loop, we maintain D(v, i) as a running maximum of the 
right-hand side of the equation in Step 2 for each v, where u — * v. The order of execution 
guarantees that all right-hand side terms are computed before they are used. Since i takes 
on O(V) values and e takes on \E\ values, the entire doubly nested loop runs in 0(VE) 
time. Step 3 checks 0(V 2 ) constraints in 0(V 2 ) time. Finally, Step 4 can be performed in 
0(VE) time using the Bellman-Ford algorithm [5, Section 25.3]. Thus, the total running 
time of Algorithm TV is 0(VE). □ 



2.3. VERIFYING PROPER CIRCUIT TIMING 



43 



TV(G,vr) 

1. Modify G by replacing w(e) on each edge e G E with (w(e) mod 2) + 2 if w(e) > 4. 

2. Compute D(u, i) for all v G V' and i = 0, 1, 2, . . . , 3 \V\ — 3 from the recurrence 

D(v, i) = d(v) + max <D(u,i — w(e)) : u —* v and i > w(e) \ . 

3. Check the following constraints for each vertex v G V and i = 0, 1, . . . , 3 \V\ — 3: 



D(v,i) < it 



1+8 

2 
2 + i 



+ 



'i-x(^) 



7i-x(«) 



if i is odd; 



if i is even. 



If any constraint is violated, then return no. 
4. If the graph G with weight -uj(e) — 2d(u)/ir on each edge u — * v has a negative-weight 
cycle, then return no; otherwise, return yes. 

Figure 2-4: Pseudocode for Algorithm TV, which tests whether a circuit G = (V,E,d,w,\) is 
properly timed under a clocking scheme 7r = (^o>7o>^i>7i)- The algorithm returns yes if the circuit 
is properly timed and no otherwise. 



We prove the correctness of Algorithm TV in three lemmas. 

We first show that the transformation of G in Step 1 yields a new circuit G' with at 
most 3 latches per wire and such that G is properly timed if and only if G' is properly 
timed. The new circuit G' does not in general compute the same function as G. The reason 
for performing the transformation is that it allows us to use an upper bound of 3 |1^| — 3 
on index i in Steps 2 and 3. Without this transformation, the algorithm could be made to 
work by letting i range as high as the number of latches on the longest simple path in the 
circuit. Performing this transformation results in a more efficient verification algorithm. 

Lemma 16 Let G = (V,E,d,w,x) be a circuit, let e G E be an edge with w(e) > 4, and 
let G' = (V,E,d,w',x) be the circuit obtained by replacing the w(e) latches on each edge 
e G E with w'ie) latches, where 



w'(e) 

Then for any clocking scheme ir 
is properly timed by it. 



J w(e) - 2 ife = e, 

1 w(e) if otherwise . 

6 ,7o, 0i, 7i), G is properly timed by ir if and only if G' 



Proof. To begin, we argue that G' is a well-formed two-phase circuit satisfying Conditions 
WF1, WF2, and WF3. First, we have w'(e) > 0, since w(e) > 4. Second, the weight of any 
cycle remains positive because w(e) > 0. Finally, Condition WF3 continues to hold because 

w(e) = w(e) (mod 2) . 

(<=) We now show that if G' is properly timed by it, then so is G. Assume G' is properly 
timed. By Lemma 14, Inequalities (2.2) and (2.3) hold for the weight function w'. But since 
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w(e) > w'(e) for all e G E, the right-hand sides of these inequalities remain the same or 
are larger for the weight function w. Since the left-hand sides are the same, the inequalities 
must hold for circuit G. 

(=>) To prove the converse, we first prove the following claim. Let p = u x ^> v x — > u 2 ~» 
v 2 be a path in a circuit that goes through an edge e££ such that w(e) > 2, and suppose 
that p 1 andp 2 both satisfy Inequalities (2.2) and (2.3). Then, we claim that p satisfies both 
inequalities. 

The proof of the claim is a case analysis that depends on all possible assignments of 
phases to ui,vi,u 2 , and v 2 . Let us consider, for example, the situation where x( u i) = 
x{ u 2) = xi v v) = an d x( v i) = 1; the other cases are similar. From Lemma 14, we have 

( 2 + w{ Pl ) \ 
d(pi) < y ) n -1/1, 

d(P2) < iy J^}K + 4>1 . 

Adding these inequalities and using the fact that w(e) > 2, we obtain 

dip) = d iPi) + dij> 2 ) 

( 'wip-i) + w(p 2 ) + 3\ 
< 7: Mr + <Pi ~ 7i 



,. 'w(p 1 )+w(e)+w(p 2 ) + V 

< ^ 7T + 01 - 7l 



,'1 + 1f(p)\ 



thus proving the claim. 

Suppose, now, that G is properly timed by it, but that G' is not. Then there must exist 
a path in G' that violates Inequality (2.2) or Inequality (2.3) with respect to the weight 
function w 1 but not with respect to the weight function w. Let p be such a path with 
the fewest edges. Path p must pass through edge e, since otherwise the inequalities for 
w and w 1 are identical. But then, since w'(e) > 2, the claim we have just proved applies 
to p. Consequently, a subpath of p must violate one of the two inequalities with respect 
to w', which contradicts the supposition that p is the shortest such path. This contradiction 
completes the proof. □ 

Corollary 17 Let G = (V, E, d, w, x) be a circuit, and let G' = (V, E, d, w 1 , x) be the circuit 
obtained by replacing the w(e) latches on each edge e G E with w'(e) < 3 latches, where 



w le) 



w(e) if w(e) < 3 , 

(w(e) mod 2) + 2 if w(e) > 4 . 



Then for any clocking scheme ir = (<^o,7o,<^i,7i), G is properly timed by ir if and only if G' 
is properly timed by it. 

Proof. For any edge e G V for which w(e) > 4, we have (w(e) mod 2) + 2 = w(e) — 2c, 
where c is the largest integer such that w(e) — 2c > 2. Thus, by Lemma 16, we can apply 
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the transformation described in the lemma c times to reduce the number of latches on any 
given edge e for which w(e) > 4 to (w(e) mod 2) + 2 < 3. Moreover, we can do the same 
for all such edges. □ 

The following lemma is used to justify Steps 2 and 3 of Algorithm TV. It shows that 
the infinitely many path constraints (2.2) and (2.3) describing the conditions for proper 
circuit timing are equivalent to a finite set of inequalities corresponding to simple paths 
and simple cycles in the circuit. The cycle inequalities motivate the computation of the 
maximum delay-to-latch ratio R(G) in the algorithm, while the path inequalities motivate 
the computation of the various D(v,i) values. 

Lemma 18 Let G = (V,E,d,w,x) be a circuit that employs a clocking scheme ir = 
(</>o,7o, </>i,7i). Then G is properly timed by ir if and only if for every simple path u^> v, 
we have 

d(p) < vr (I±|M) + fo_ xW , (2 . 6 ) 

ifx{u) +X{v), and 



tf x( u ) = x{ v )j and for every simple cycle c, iue have 

dip) < vr (^) . (2.8) 

Proof. We show that the inequalities given in Lemma 14 are equivalent to Inequalities (2.6), 
(2.7), and (2.8). 

Suppose first that these three inequalities follow from Inequalities (2.2) and (2.3) from 
Lemma 14. Since any simple path is a path, Inequalities (2.6) and (2.7) follow immediately 
from Inequalities (2.2) and (2.3). Now, suppose that the remaining Inequality (2.8) is 
violated for some cycle c. Let u — > v be an edge on c, and consider a path that goes from u 
to v along e and then k times around the cycle c, finally ending at v, for some k > 0. Assume 
that x( u ) ¥" x( v )i the situation when x( u ) = x( v ) i s similar. Then by Inequality (2.2), we 
have 

,/ x , ,/ s ,/ x f 1 + w(e) + k ■ w(c)\ 

d{ U ) + k ■ d(c) + d(v) < 7T I ^-L- y -L J + </>!- x (,) • 

Rewriting this inequality, we obtain 

, / ,, x w(c)\ /l + w(e)\ , ,. , ,. , . 

k (d(c) ~ TT^J < 7T (^ ^J + 1 _ x(o) - d(u) - d( V ) . (2.9) 

But, since Inequality (2.8) is violated for c, we have d(c) > ir(w(c)/2), which means that 
the left-hand side of Inequality (2.9) can be made to exceed the right-hand side by choosing 
k sufficiently large. This contradiction proves that Inequality (2.8) holds. 

We now prove that Inequalities (2.6), (2.7), and (2.8) imply Inequalities (2.2) and (2.3). 
Suppose Inequalities (2.6), (2.7), and (2.8) hold, and let u ~> v be a path with the minimum 
number of edges that violates Inequality (2.2). (The proof for Inequality (2.3) is similar.) 
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Thus, we have 

fl+w(p)\ 
d{p) > vr (^ ^J + 1 _ x(o) . 

The path p cannot be simple, because then it would violate Inequality (2.6) or Inequality 
(2.7), and thus it must contain a simple cycle c. 

Consider the path p' derived from p by removing the cycle c. The propagation delay of 
p' is d(p') = d(p) — d(c), and the number of latches on p' is w(p') = w(p) — w(c). Using 
Inequality (2.8), we have 

d(p') = d(p) - d(c) 

fl+w(p)\ 
> *[ 2 ) + (Pi-xiv) ~ d(c) 

' l+w(p) \ fw(c) 

> ~ I — 2 — ) + fa-xw - * y—^- 

1 + w (p) — w(c)' 



2 / +<Pi-x(v) 

l+w{p>) , , 

+ Vi-x(«) • 

Thus, p' also violates Inequality (2.2), and p' has fewer edges than p, contradicting the 
minimality of p. □ 

Lemma 18 shows that the infinitely many constraints of Lemma 14 can be summarized 
by a finite set of constraints, but in general, the number of these constraints can still be 
exponential in the size (number of vertices and edges) of the circuit. The path delays D(v,i) 
computed in Step 2 of Algorithm TV allow us to further reduce the number of constraints 
to 0(1 /2 ), as is shown by the following lemma. 

Lemma 19 Let G = (V,E,d,w,x) be a circuit that employs a clocking scheme ir = 
(</>o,7o, </>i,7i), and let W be an upper bound on the maximum number of latches on any 
simple path in G. For v G V and i = 0, 1, . . . , W , let 

D(v, i) = max{(i(p) : u ^ v and w(p) = i} . (2-10) 

Then G is properly timed by ir if and only if for every vertex v G V and i = 0, 1, . . . , W , we 
have 

£(M<^(^)+<£i-xM (2-11) 

if i is odd, and 

'2 + i 



D(*M) < tt ( — ) - 7i- x( i0 ( 2 - 12 ) 



if i is even; and for every simple cycle c, Inequality (2. 

'w(cy 



d(c) < 7T ( - 



holds. 
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Proof. (=>■) First, we assume that G is properly timed and show that Inequalities (2.11), 
(2.12), and (2.8) are satisfied. If i is odd, then by Condition WF3', we have x( u ) ¥" x( v )'i 
and thus by Inequality (2.2), we have 

D(v,i) = d{jp) 

,'l+w(p)\ , 

< "I ^J+^i-xW 

l+*\ 

and hence Inequality (2.11) holds. Similarly, if i is even, we can prove that Inequality (2.12) 
holds. The statement of Lemma 18 states that Inequality (2.8) holds, and thus all three 
inequalities are satisfied. 

(•<=) To show the other direction of the lemma, assume that Inequalities (2.11), (2.12), 
and (2.8) are satisfied. We must prove that G is properly timed. Let u ~> v be any simple 
path in G with w(p) latches. By the definition (2.10), we have d{j>) < D(v,w(p)) = D(v,i) 
for some i < W, since p is simple. If x{ u ) ¥" x( v ), then by Condition WF3', the value i is 
odd, which means 

dip) < D(v,i) 

,' 1 + i \ 

< n \ —^~ I + Vl-x{v) 



2 
1 + w(p) 



+ 



h-x(v) i 



and hence Inequality (2.6) holds. Similarly, if x( u ) = x( v )> we can conclude that Inequal- 
ity (2.7) holds. Since Inequality (2.8) holds by assumption, by Lemma 18, G is properly 
timed by it. □ 

We are now able to prove the correctness of Algorithm TV. 

Theorem 20 Algorithm TV solves the timing verification problem for a two-phase, level- 
clocked circuit G = (V,E,d,w,x) and a clocking scheme ir = (^0,70,^1,71) in 0(VE) 
time. 

Proof. Corollary 17 shows that it suffices to verify the circuit as modified by Step 1 of 
Algorithm TV. Moreover, the transformation results in at most 3 latches per edge of the 
circuit, and hence, the number of latches on any simple path in the circuit is at most 
3 \V\ — 3. A simple inductive argument shows that the recurrence in Step 2 computes the 
values D(v,i) as defined by Equation (2.10), if we choose W = 3 \V\ — 3. 

The remainder of the algorithm tests whether the constraints for proper timing from 
Lemma 19 are satisfied. Inequalities (2.11) and (2.12) are checked directly by Step 3. 
Inequality (2.8) is checked indirectly by Step 4, which can be seen as follows. Sum the edge 
weights around a cycle. If the cycle has a negative weight, then Inequality (2.8) is violated. 
Conversely, if there is no negative-weight cycle, then every cycle in the graph satisfies the 
inequality. 

The running time bound follows from Lemma 15. □ 
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NCSA(G,tt,u) 

1 Compute single-source shortest-paths lengths l(v) from source u on graph G with 
edge-length w'(e) = \_w(e)/2\ 7r + (w(e) mod 2) {"f x ( x ) + 4>i- x (x)) — d(y) for each edge x —* y G E. 

2 Compute single-destination shortest-paths lengths l'(v) to sink u on graph G with 
edge-length w'(e) = \_w(e)/2\ ir + (w(e) mod 2) (7 x (z) + </>i- x (a;)) ~~ ^G/) f° r eac h edge x -^ y £ E. 

3 <5(u) <- min j/(t<) + w'(e) : v -^ u e e\ 

4 <5(«) = min {6(u), min„ ey {4> x ( v ) ~ d(v) + l'(v)} + min^y {l(v) + 7 xW + 4>i- x ( v )}} 

5 if %) < 

6 then return fail 

7 else return 8(u). 

Figure 2-5: Algorithm NCSA for the noncritical sensitivity analysis problem. The algorithm 
takes as input a two-phase circuit G, a clocking scheme ir, and a vertex u. It produces as output the 
maximum increase S(u) in the delay d(u) that will not affect the proper timing of G by the clocking 
scheme 7r. 

2.4 Sensitivity Analysis 

In this section we consider two sensitivity analysis problems for two-phase, level-clocked 
circuits. This analysis identifies timing bottlenecks in the circuit as well as parts of the 
circuit that have potential for further optimization. 

The first problem we consider is the noncritical sensitivity analysis problem: Given a 
two-phase, level-clocked circuit G = (V, E, d,w, \) and a clocking scheme ir = (^0,70,^1,71) 
such that G is properly timed by ir, determine for any given combinational block u G V the 
maximum possible increase 8(u) in its propagation delay d(u) such that G is still properly 
timed by it. 

The second problem we consider is the critical sensitivity analysis problem: Given a 
two-phase level-clocked circuit G = (V, E, d,w, \) and a clocking scheme ir = (</>o,7o, </>i,7i), 
determine for any given combinational block u G V the minimum possible 8(u) such that 
G is properly timed by ir when the propagation delay d(u) decreases by 8(u). 

2.4.1 Noncritical Sensitivity Analysis 

In this section we present two algorithms for the noncritical sensitivity analysis problem. 
The first algorithm solves this problem for a single combinational block in the circuit in 
0(VE) time. The second algorithm solves the noncritical sensitivity analysis problem for 
all combinational blocks in the circuit in 0(VE + V 2 \gV) time. 

Algorithm NCSA, shown in Figure 2-5, solves the noncritical sensitivity analysis prob- 
lem for a single combinational block u G V in 0(VE) time. The algorithm computes the 
maximum increase 8(u) in the propagation delay d(v) such that Inequalities (2.6), (2.7), 
and (2.8) that describe a properly timed circuit are not violated. The slack of Inequal- 
ity (2.8) is computed in Step 3 after a single-source shortest-paths computation with source 
v and a single-destination shortest-paths computation with sink v on the graph G with edge- 
length w'(e) = \_w(e) /2j ir + (w(e) mod 2) (7 x (z) + <^i- x (z)) — d(y) for each edge x —* y G E. 
This edge-length accounts automatically for the slack between the rise-to-fall time and the 
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propagation delay along any path. The slack of Inequalities (2.6) and (2.7) is computed in 
Step 4. In this step, the algorithm computes the simple path p through u with minimum 
r(p) — d(p), where r(p) is the rise-to-fall time of p. The algorithm returns the maximum 
value 6(u) > 0, such that for d(u) <— d(u) + 6(u), the circuit G is still properly timed by 
the given it. If 8(u) is negative, then G is not properly timed by ir, and in this case the 
algorithm fails. 

The following lemma proves a bound on the running of Algorithm NCSA. 

Lemma 21 Algorithm NCSA terminates in 0(VE) time. 

Proof. The single-source shortest-paths problem in Step 1 and the single-destination shortest- 
paths problem in Step 2 can be solved in 0(VE) time using the Bellman-Ford algorithm [5]. 
The minimization in Step 3 requires 0(V) time, and the minimization in Step 4 requires 
0(V 2 ) time. Therefore, the total running time of the algorithm is O(VE). □ 

The following lemma proves the correctness of Algorithm NCSA. 

Lemma 22 Let G = (V,E,d,w,x) be a two-phase, level-clocked circuit, and let ir = 
(</>o,7o, <^i,7i) be a clocking scheme such that G is properly timed by it. Then, for any 
given combinational block u G V , Algorithm NCSA correctly determines the maximum 
possible increase in its propagation delay d(u) such that G is still properly timed by it. 

Proof. According to Lemma 18, the circuit G is properly timed by ir if and only if In- 
equalities (2.6), (2.7), and (2.8) are satisfied. We will show that these inequalities are still 
satisfied when d(u) increases by the value 6(u) that is computed by Algorithm NCSA. We 
will also show that some of these inequalities is violated for greater values of 6(u). 

First, let us consider Inequality (2.8). Increasing d(u) can only violate Inequality (2.8) 
for simple cycles that go through vertex u. From Step 3 we have 



8(u) = min I l(y) + w'(e) : v —* u G E > 
lin < w'(e) :m^m££ 



mm ■ 



mm 



mm 



Y^ (|w(e)/2j vr + (w(e) mod 2) (^ x{x) + 4>i- x (x)) ~ d (y)) -u^ueE 



>x^y£c 



(7i+</>o) J2 ™°( e ) +(7o + </>i) J2 w 1 (e)\-J2d(x):u^u€E 



\x — 'H^lC / \x — 'H^lC 



x£c 



= min {( 7l + 4> )w(c)/2 + ( 7o + 4> 1 )w(c)/2 - d(c) :u^ueE^ 
= min |(7! + 4> + 7o + </ , i)'^(c)/2 - d(c) : w^> u G E\ 

where w (e) and w 1 (e) denote the number of latches for each edge e G E that are clocked 
on cf> and cf>i, respectively. From the last inequality, using Equation (2.1), we infer that 

6(u) = min I nw(c)/2 — d(c) : u ~» u G E > . 

Therefore, this value of 6(u) is the maximum by which we can increase d(u) without violating 
Inequality (2.8). 
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AllNCSA(G,tt) 

1 Compute all-pairs shortest-paths lengths l(u,v) for each pair u,v G V, on graph G 
with edge-length w'(e) = \w(e)/2\ ir + (w(e) mod 2) (7 X ( U ) + <f>i-x(u)) ~ d(v) 

for every edge u — > v G -E. 

2 for each u E V 

3 do <5(u) = min < /(«, w) + w'(e) : v -^ u £ E> 

4 for each u E V 

5 do <5(u) = min {6(u),mm v€V {4> x ( v ) - d{v) + l(v,u)} + min„ e y {/(w,i>) + 7 X («) + 0i-x(«)}} 

6 for each u E V 

7 do if%) > 

8 then return <5(m) 

9 else return fail 

Figure 2-6: Algorithm AllNCSA which solves the noncritical sensitivity analysis problem for 
all combinational blocks in a circuit G. The algorithm takes as input a two-phase circuit G and 
a clocking scheme ir. It produces as output the maximum increase S(u) in the delay d(u) of each 
combinational block u (E V such that the proper timing of G by the clocking scheme 7r is not affected. 

Now, let us consider Inequalities (2.6) and (2.7). In Step 4 the algorithm computes a 
value 8(u) such that 

6(u) < min < <j> x(x) - d(x) + J^ to'(e) + J^ to'(e) + 7 x(j/) + </>i- x (j/) : a; -^> -u, -u -^ y, x, y G F I 

I egp egg ) 

= min < r(r) — d(r) : x^ y,u E r, and x,y E V> . 

Therefore, if we increase d(u) by this value of 6(u) we do not violate any of the Inequalities 
(2.6) and (2.7). Moreover, this is the maximum value of 6(u) that does not violate these 
inequalities, because it satisfies the one that corresponds to the path r with equality. □ 

Algorithm AllNCSA, shown in Figure 2-6, solves the noncritical sensitivity analysis 
problem for all combinational blocks u G V in 0(VE + V^lgF) time. This algorithm 
simply solves an all-pairs shortest-paths problem on G with edge-lengths chosen in a way 
that accounts automatically for the path or the cycle with the minimum slack. 

In the following two lemmas, we prove the correctness of Algorithm AllNCSA and a 
bound on its running time. 

Lemma 23 Let G = (V,E,d,w,x) be a two-phase, level-clocked circuit, and let ir = 
(</>o,7o, </>i,7i) be a clocking scheme such that G is properly timed by it. Then, for each 
combinational block u G V , Algorithm AllNCSA correctly determines the maximum pos- 
sible increase in its propagation delay d(u) such that G is still properly timed by it. 

Proof. For every u G V, we repeat the proof of Lemma 22. □ 

Lemma 24 Algorithm AllNCSA terminates in 0(VE + V 2 lg V) time. 

Proof. The all-pairs shortest-paths computation in Step 1 can be performed in 0(VE + 
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CSA(G,ir,u) 

1 if TV(G, vr) =yes 

2 then return 

3 d init (u) <— d(u) 

4 d(u) <- 

5 if TV(G,vr) =no 

6 then return "G cannot be properly timed just by reducing d(u) v 

7 else (u) <- NCSA(G,tt,«) 

8 6(«) <- d init (u) - d(u) 

9 return <5(m) 

Figure 2-7: Algorithm CSA for the critical sensitivity analysis problem. The algorithm takes as 
input a circuit G, a clocking scheme ir, and a vertex u. It produces as output the minimum S(u) 
such that G is properly timed by 7r if we decrease the propagation delay d(u) by S(u). 

V 2 lgV) time using Johnson's algorithm [5]. The minimizations in Steps 2 and 4 require 
0(V 2 ) time. Therefore, Algorithm AllNCSA terminates in 0(VE + V 2 lg V) time. □ 
Note that either Algorithm NCSA or Algorithm AllNCSA can be used to detect 
combinational blocks that lie on critical paths or critical cycles for the given clocking scheme 
it. For each critical block u, the value 8(u) computed by these algorithms is simply zero. 
Moreover, Algorithm AllNCSA can detect if the circuit G is not properly timed by the 
given clocking scheme it. If the clocking scheme it does not satisfy the timing constraints 
around some simple cycle, then the shortest-paths computation in Step 1 detects a negative 
edge-weight cycle and the algorithm fails. If the clocking scheme ir does not satisfy the 
timing constraints along some path, then 8(u) is negative, and the algorithm fails again. 

2.4.2 Critical Sensitivity Analysis 

Algorithm CSA, shown in Figure 2-7, solves the critical sensitivity analysis problem 
for a single combinational block u G V. The algorithm returns if G is properly timed 
by the given clocking scheme it. Otherwise, it saves the original propagation delay d(u) 
in the variable d init (u) and computes the maximum nonnegative d(u) that will not affect 
the proper timing of G by the clocking scheme it. The difference between the original 
propagation delay of u and its new propagation delay gives 8(u). It is straightforward to 
verify the correctness of Algorithm CSA. The algorithm terminates in 0(VE) steps, since 
Algorithm NCSA runs in 0(VE) time. 

2.5 Period minimization by clock tuning 

In this section, we address the tuning problem: Given a two-phase, level-clocked circuit 
G = (V,E,d,w,x) and gaps 7 and / y 1 , compute a clocking scheme 7r* = ($j>7o><^i>7i) with 
minimum period such that G is properly timed by ir*. We show that two-dimensional linear 
programming can be used to solve this problem in 0(VE) time. 

The basic idea behind the period minimization algorithm is to view ir, 4> , and fa as 
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a* 



t 



Figure 2-8: The tuning problem as a twodimensoinal linear program in the (fa,ir) plane. The 
lines with negative slope describe constraints on paths u ~^> v with x(u) = an \(v) = 1. The lines 
with positive slope correspond to constraints on paths u ~^> v with x(u) = 1 and \(v) = 0. The 
horizontal lines describe constraints on paths with x(u) = x(v), and the lower bound on the clock 
period due to the constraints on cycles. The shaded area is the set of feasible points for the linear 
program. The circuit G is properly timed by every clocking scheme that corresponds to a point in 
the shaded area. 



variables. By Equation (2.1), we have fa = ir — 7 — "y 1 — fa, and thus, the constraints 
defined in Lemma 19 can be viewed as inequalities in the two variables ir and fa. In fact, 
they form a linear program in which the objective is to minimize it. This program can 
be described by a set of lines in the (fa, it) plane, as shown in Figure 2-8. We distinguish 
three kinds of lines corresponding to Inequalities (2.11), (2.12), and (2.8). The next three 
lemmas show how to efficiently derive each of the three sets of linear constraints that form 
the linear program. 

Lemma 25 In 0(VE) time, the constraints defined by Inequality (2.11) can be reduced to 
an equivalent set of 0(V) linear inequalities in the variables ir and fa. 

Proof. The constraints defined by Inequality (2.11) depend on both ir and fa. We can 
separate these constraints into two sets depending on the value of x( v )- If x(^) = 1) then 
the inequality becomes 

~ (* + l)/2 ' V ; 

which defines a half-plane of feasible (^o,^) points for each fixed i. If x( v ) = 5 the 
inequality becomes 

vr> Z3( "' ?) + 7o + 7l+ ^, (2.14) 

which once again defines a half-plane of feasible (fa, it) points for each fixed i, since 70 and 
7! are constants (see Figure 2-8). Each inequality holds for each v G V and for each odd i 
in the range 1 < i < 3 \V\ — 3. The 0(V 2 ) constraints defined by Inequality (2.11) can be 
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determined in 0(VE) time by computing the values D(v,i), as in Step 2 of Algorithm TV. 
These constraints can be reduced in 0(V 2 ) time to an equivalent set of 0(V) constraints by 
selecting, for each odd i in the range 1 < i < 3 \V\ — 3, the particular constraint for which 
D(v,i) is maximized. The total running time is 0(VE) + 0(V 2 ) = 0(VE). □ 

Lemma 26 In 0(VE) time, the constraints defined by Inequality (2.12) can be reduced to 
a single lower bound on the clock period ir. 

Proof. The constraints defined by Inequality (2.12) can be rewritten as 

vr > Z3( ^ )+ 7 / 1 - XW , (2-15) 

(* + 2)/2 ' V ; 

for even i. Each of these constraints depends on ir but not on <f> (see Figure 2-8). Conse- 
quently, these constraints together determine a single lower bound on ir which is independent 
of the duty cycle of either phase. After computing the values of D(v,i) in 0(VE) time, this 
bound on ir can be determined by simply finding the maximum of the 0(V 2 ) right-hand 
sides of Inequality (2.15). □ 

The third of our lemmas focuses on Inequality (2.8). As in Lemma 26, we can compute 
a single lower bound on the clock period ir which is independent of the duty cycles. This 
lower bound can be found by solving a "tramp steamer" problem. 

The tramp steamer problem (also known as the minimum cost-to-time ratio cycle prob- 
lem) was formulated in [7] as follows. Let G = (V, E, s,t) be a directed graph in which each 
edge u — > v € E has an integer cost s(e) and an integer transit time t(e), such that for any 
cycle c in G, we have E eec i(e) > 0- F° r any cycle c in G, define the cost-to-time ratio of 
the cycle by 

R(c) 



Eegc g ( e ) 



The problem is to find 

R(G) = min{i?(c) : c is a cycle in G} , 

which is the minimum such ratio over all cycles in the graph. If t(e) > for all e G E, then 
the algorithm from [16] can solve the tramp steamer problem in 0(TE) time, where 

T =J2 max{t(e) : u -^ v e E} . 

The following lemma relates the constraints determined by Inequality (2.8) to the tramp 
steamer problem. 

Lemma 27 In 0(VE) time, the constraints defined by Inequality (2.8) can be reduced to a 
single lower bound on the clock period it. 

Proof. Given the circuit G = (V, E, d, w, x), define G = (V, E,s,t) to be the graph obtained 
by assigning to each edge u — > v € E a cost s(e) = —d(u) and a transit time t(e) = w(e). 
Then we claim Inequality (2.8) is satisfied if and only if 

7T > -2R(G) , (2.16) 
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where R(G) is the minimum cost-to-time ratio of any cycle in G. 

We first prove that Inequality (2.8) implies Inequality (2.16). Let c be the cycle in 
G with minimum cost-to-time ratio, that is, R(c) = R(G). By Inequality (2.8), we have 
d(c) < 7t(id(c)/2), and hence 



> 



2d(c) 
w{c) 



2 E egc -g(e) 

= -2R(G) . 

The proof for the other direction of the claim is similar. 

Using the algorithm for the tramp steamer problem given in [16], the cycle constraints 
can be checked in 0(VE) time. In order to obtain this running time, we must guarantee 
that the transit time of any path with \V\ edges is 0(V). Indeed, this requirement is met, 
since from Corollary 17 the number of latches on any edge can be reduced to at most 3. □ 

The following theorem combines Lemmas (25), (26), and (27) to solve the tuning problem 
for two-phase circuits. 

Theorem 28 The tuning problem can be solved for a two-phase, level- clocked circuit G = 
(V,E,d,w,x) and gaps ^q and "y 1 in 0(VE) time. 

Proof. By Lemma 19, any clock period ir must satisfy Inequalities (2.11), (2.12), and (2.8), 
which, by Lemmas (25), (26), and (27), reduce to Inequalities (2.13), (2.14), (2.15), and (2.16) 
which are linear in </> and ir. We additionally must ensure that only valid clocking schemes 
are considered, which we can do by adding the constraints </> > and (pi = ir — 7 — "y 1 — cf> > 
0. Thus, all the constraints can be phrased as linear inequalities in </> and tt, as is shown 
in Figure 2-8. 

By linear programming theory [41], the optimal clock period ir* can be obtained at 
a point (</>q,7t*) corresponding to the intersection of these 0(V) constraints. Megiddo's 
algorithm [37] can solve such a two-dimensional linear program in 0(V) time. Alternatively, 
one can first compute the 0(V 2 ) intersections among Inequalities (2.13) and (2.14), </> > 0, 
and vr — 7 — 7! — </) > (the nonhorizontal constraints). Then, let (4>' ,7r') be the intersection 
point of maximum ir value, and let 7r" be the greatest lower bound on ir derived from the 
remaining (horizontal) inequalities. The optimal period is ir* = max{7r',7r"}, since (cj)' ,Tr') 
is above every nonhorizontal line, and all linear inequalities constrain ir from below. A 
feasible phase corresponding to 7r* is (f)* a = 4>' . In either case, the overall running time is 
0(VE). a 

2.6 Retiming with Symmetric Clocking Schemes 

In this section, we present an 0(VE + V 2 lg V r )-time algorithm for the retiming problem with 
symmetric clocking schemes: Given a two-phase, level-clocked circuit G = (V, E, d, w, \) and 
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a symmetric clocking scheme ir = ((f), 7, (f>, 7), compute a retiming of G which is properly 
timed by it, or else determine that no such retiming exists. Like several previous retiming 
algorithms [29, 52], our algorithm casts retiming for a symmetric clocking scheme as a 
mixed-integer linear program. 

The retiming transformation relocates the latches in a circuit G without changing the 
functionality of the circuit. Given an integer r(v) for each vertex v G V , we retime the 
circuit G by removing r(v) latches from each output wire oft; and inserting r(v) latches to 
each input wire oft;. The variable r(v ) is called the lag of vertex v, because it counts by how 
many phases we delay the output of v 's computation when we retime G. After retiming, 
we obtain a retimed circuit G r = (V,E,d,w r ,x r ), where 



tu r (e) = w(e) + r(v) — r(u) (2-17) 



for every edge u — > v G E, and 

Xr(v) 



x( v ) if r ( v ) is even 

1 ~~ x( v ) if r ( v ) is 0( ld , 



for every vertex v G V. To ensure that G r is a well-formed two-phase circuit, we require 
that all edge weights in G r are nonnegative, which is equivalent to the condition that 

w(e) + r(v) - r(u) > (2.18) 

for every edge u —* v G E. We need not check any other conditions for G r to be well formed, 
since retiming does not change the weight of a cycle, which means that all cycles retain the 
positive weight they had in G, and one can verify that Condition WF3 holds over all edges 
in G r . 

Given a two-phase, level-clocked circuit G = (V,E,d,w,x) and a symmetric clocking 
scheme ir = ((f), 7, (f>, 7), the problem of retiming with symmetric clocking schemes is to 
compute a retiming function r such that G r is a well-formed circuit which is properly timed 
by it, or to determine that no such retiming exists. The following lemma gives a set of 
necessary and sufficient constraints that such a retiming r must satisfy. 

Lemma 29 Let G = (V, E, d, w, \) be a two-phase, level-clocked circuit, let ir = ((f>, 7, (f>, 7) 
be a symmetric clocking scheme, and let r : V — > Z be a retiming function. Then, the 
retimed circuit G r is properly timed by ir if and only if 

r(u) - r(v) < w(e) (2.19) 

holds for every edge u — * v G E, and 

2 „.A / 2 



r(u) — r{ _ 

i— *j€p 



v)< J2 (Me)--d(j) + 2 --(%) + 7) (2-20 



for every path u ^ v in G r . 

Proof. (=>) Let G r be properly timed by ir. Since all edge weights in G r are nonnegative, we 
have w r > 0, which by Equation (2.17) implies Inequality (2.19). By Lemma 14, Inequalities 
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(2.2) and (2.3) must also hold, but both reduce to 

%)<vr(^) + .- 7 , 

since the clocking scheme ir is symmetric with period ir = 2<f> + 27. Let us consider a path 
u ~> v. Using the fact that w r (p) = w(p) + r(v) — r(u), which can be proved by induction 
from Equation (2.17), we obtain 

d(p) < vr(^M) + .- 7 



w(p) + r(v) — r(u) 



+ 7T - 7 



which can be rewritten as 

r(u) — r(v) 



< 


w(p) 


- -dip) 

7T 


+ ('- 


27 
7T 


= 


E 


(w(e) - 


-d(j)) 


- 


= 


E 


(w(e) - 


-d(j)) 


+ 



-d(u) + ( 2 - ?1 

7T \ 7T 



--(d(«)+ 7 : 



Therefore, Inequality (2.20) holds. 

(•<=) All of the implications in the proof of the forward direction can be reversed. □ 
Lemma 29 provides necessary and sufficient conditions that a retiming r must satisfy 
such that G r is a well- formed circuit which is properly timed by a clocking scheme it. Unfor- 
tunately, there are an infinite number of constraints in the set specified by Inequality (2.20). 
The following theorem shows that the number of constraints can be reduced to 0(V + E). 

Lemma 30 Let G = (V,E,d,w,x) be a two-phase, level-clocked circuit, and let ir = 
(4>, 7, (f>, 7) be a symmetric clocking scheme. Then there exists a retiming r : V — > Z of 
G such that G r is properly timed by ir if and only if there exists an assignment of a real 
value R(v) and an integer value r(v) to each vertex v G V such that the following conditions 
are satisfied: 

(2.21) 
(2.22) 

(2.23) 
(2.24) 

Proof. (•<=) From Lemma 29 it suffices to show that Inequalities (2.21), (2.22), (2.23), 
and (2.24) imply Inequalities (2.19) and (2.20). Inequality (2.19) is immediately satisfied, 
since by Inequality (2.21) we have r(u) — r(v) < w(e) for all u —* v G E. To prove 



r(u) 


— r(v) 


< 


w{e) 


for all u — > v € E, 


R(v) 


— r(v) 


< 





for all v € V, 


R(u)- 


-R(v) 


< 


2 
w(e) d{v) 

7T 


for all u — > v € E, 


r(u) - 


- R(u) 


< 


2--(d(«)+ 7 ) 

7T 


for all u € V. 
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Inequality (2.20), consider a path u ~> v. Inequality (2.23) holds for every edge on p, and 
if we sum all these inequalities, we obtain 

R(u)-R(v)< J2 (w(e)--d(j) 

Z — V 7r 

since the left-hand side telescopes. Adding this inequality to Inequalities (2.22) and (2.24) 
yields Inequality (2.20). 

(=>) By Lemma 29 we need only show that if r is an assignment of integers to the 
vertices in V that satisfies Inequalities (2.19) and (2.20), then we can find an assignment 
R of reals to the vertices in V such that r and R satisfy Inequalities (2.21), (2.22), (2.23), 
and (2.24). Inequality (2.18) directly implies Inequality (2.21). We construct an auxiliary 
graph to determine R and show that the remaining three inequalities are satisfied. 

Define the auxiliary graph H = (V H ,E H ,w H ) by 



V H = vu{t} 


E H = EU{u^t:ueV}U{t^u:ueV} 




r(v) for all v —> t G E H , 


rOO = < 


w(e) — ^d(v) for all u —> v G E , 




-r(u) + 2 - l{d{u) + 7) for all t -4 u G E H , 



where t is an additional vertex not in V. Define R(v) for all v G V H as the length of 
a shortest (least-weight) path in H from i; to t, which is well-defined if H contains no 
negative-weight cycles [5, Chapter 25], a fact that we shall prove shortly. 

Assuming that H contains no negative-weight cycles, we can prove Inequalities (2.22), 
(2.23), and (2.24) by relying on the following basic inequality of shortest paths [5, Chap- 
ter 25]: 

R(u) < R(v) + w H (e) (2.25) 

for every edge u —* v in E H . To prove Inequality (2.22), we consider Inequality (2.25) with 
w H( e ) = r(u) for all edges u — > t: 

R(u) < R(t)+r(u) 
< r{u) , 

since the shortest path from t to itself has length R(t) = 0. Inequalities (2.23) and (2.24) 
follow from similar reasoning by considering Inequality (2.25) with the other two classes of 
edge weights. 

It remains to show that H contains no negative-weight cycles. Suppose, for the sake 
of contradiction, that u ^ u is a negative-weight cycle in H that contains the minimum 
number of edges among all negative-weight cycles in H. The cycle c visits t at most once, 
since otherwise, c would contain a negative-weight subcyle with fewer edges. We consider 
the cases t G c and t £ c separately. 

If t is a vertex in c, we can break the cycle c into three parts: c = t-^w^*v-^t, where 
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p does not visit t. Breaking the weight of c into its constituent parts, we obtain 

w H (c) = w H (ei ) + w H (p) + w H (e 2 ) 

-r(u) + 2-^-(d(u)+ 1 )j+ J2 (w(e)-^d(j)j+r(v) 

< o, 

which, by moving r(u) and r(v) to the right-hand side of the strict inequality, directly 
contradicts Inequality (2.20). 

If t is not a vertex in c, we consider a path p that consists of k > repetitions of 
the cycle c, and which starts and ends at some vertex u on c. Since w H (c) < 0, we can 
make w H (p) = k ■ w H (c) as negative as we wish by picking k sufficiently large. From 
Inequality (2.20), we obtain 

= r(u) — r(u) 

< J2 (»(e)-fd(j)) + (2-£(d(t0+ 7 : 

= ™ H (p) + (2--K«) + 7 ; 

= fc-™ H (c)+ ^2-^(d(«)+ 7 ; 

< o, 

by picking k > — (2 — (2/7r)(d(«) + 7)) /w H (c). This contradiction completes the proof. □ 
The set of constraints defined in Lemma 30 form a mixed-integer linear programming 
problem. Although mixed-integer linear programming is in general NP-hard [13], the sim- 
ple form of the constraints in the lemma allows the problem to be solved efficiently. In 
particular, Inequalities (2.21), (2.22), (2.23), and (2.24) constitute a mixed-integer linear 
programming problem of the following form. 

Problem MI Let H = (V,Vj,E,a) be an edge-weighted, directed graph, where V = 
{l,2,...,n} is the vertex set, Vj (the "integer" vertices) is a subset of V , the edge set 
E is a subset ofVxV, and for each edge (i,j) € E the edge weight a(i,j) is a real number. 
Find a vector x = (xi,x 2 , ■ ■ ■ ,x n ) satisfying the constraints that 

x { - Xj < a(i,j) 

for all (i,j) € E, and that X{ £ Z for all i £ Vj, or determine that no feasible vector 
exists. □ 

Problem MI can be solved in 0(VE + V 2 lg V) time by applying Algorithm MILP from [30]. 
Thus, we obtain the following theorem. 

Theorem 31 The retiming problem with symmetric schemes can be solved for a two-phase, 
level- clocked circuit G = (V,E,d,w,x) and a symmetric clocking scheme ir = (</>, 7, </>, 7) in 
0(VE + V 2 lgV) time. 
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RwSCS(G,vr) 

1 Generate Inequalities (2.21), (2.22), (2.23), and (2.24) from G and it. 

2 Apply Algorithm MILP on Inequalities (2.21), (2.22), (2.23), and (2.24) to compute a retiming r. 

3 if all constraints are satisfied 

4 then return r 

5 else fail 

Figure 2-9: Algorithm RwSCS for retiming with symmetric clocking schemes. The algorithm 
takes as input a two-phase, level-clocked circuit G = (V, E, d, w, \) and a symmetric clocking scheme 
it = (4>,j,(j),j). It produces as output a retiming r such that G r is properly timed by 7r, or else it 
determines that no such retiming is possible. 

Proof. Algorithm RwSCS in Figure 2-9 solves the retiming problem with symmetric phases. 
It simply applies Algorithm MILP from [30] to the constraints in Lemma 30. Since \V H \ = 
\V\ + 1 and \E H \ = 0(V + E), the running time of RwSCS is Q(VE + V 2 lg V). □ 



2.7 Retiming with General Clocking Schemes 

In this section, we study the retiming problem: Given a circuit G = (V,E,d,w,x) and a 
clocking scheme ir = (^0,70,^1,71), compute a retiming function r such that G r is a well- 
formed circuit which is properly timed by it, or else determine that no such retiming exists. 
First, we make a short digression in order to introduce integer monotonia programming, a 
problem related to integer linear programming in which both sides of the inequalitites are 
monotone, but not necessarily linear, functions of the unknowns. We consider a simple case 
of integer monotonic programming in which each left-hand side is a function of a single 
unknown and show that problems of this nature admit a unique minimum solution x. We 
describe a generic procedure for finding x. We then cast the retiming problem in the form of 
a simple integer monotonic programming problem and use the generic procedure to obtain 
an 0(V 3 )-t\me algorithm for the retiming problem. 

The integer monotonic programming problem is defined as follows. 

Definition Let S be a set of constraints over the unknowns Xi,x 2 , ■ ■ ■ ,x n , in which the kth 
constraint has the form 

Jk\ x li x 2i ■ ■ ■ ; x n) ^ 9k\ x li x 2i ■ ■ ■ , x n) , 

where the functions f k and g k are monotonically increasing with respect to each x it for 
j = 1, 2, . . . ,n. The integer monotonic programming problem is to find a vector x = (xi,x 2 , 
...,x n ) of integers satisfying S, or determine that no feasible vector exists. An integer 
monotonic programming problem is simple if each f k is a function of a single unknown; 
thus, the kth constraint has the simpler form 

fk( x i) > 9k( x l, x 2,--- , x n) ■ 

a 
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Based on the monotonicity of f k and g k , we can argue that if a simple integer monotonic 
program has a solution over the nonnegative integers, then it has a unique "minimum 
solution" . 

Lemma 32 Let x = (xi,x 2 , ■ ■ ■ ,x n ) and x' = (x\,x 2 , . . . ,x' n ) be two solutions to some 
simple integer monotonic program, and let x" = min{£j,2^} for alii = 1,2,... ,n. Then, 
x" = (x",x", . . . ,x") is also a solution. 

Proof. Consider the constraint fk(xi) > gki x i,x 2 , ■ ■ ■ ,x n ) hi the simple integer monotonic 
program. Assume, without loss of generality, that x" = X{. It follows that 

fk(x") = fk(Xi) 

> 9k{x\,x 2 ,... ,x n ) 
d- 9k \ x i j x 2 , . . . , x n J , 

since g k is monotonically increasing with respect to all its arguments. Therefore, x" is also 
a solution to the monotonic program. D 

Corollary 33 For any simple integer monotonic program having a solution in which X{ > 
fori = 1,2, ... ,n, there exists a unique minimum solution (x~i,x 2 ,...,x n ) which is min- 
imum in the sense that for all other solutions (xi,x 2 , . . . ,x n ), we have Xi < x { for i = 
1,2,... ,n. 

Proof. The proof follows from Lemma 32 and the nonnegativity of the X{. □ 

If a simple integer monotonic programming problem has a solution over the nonnega- 
tive integers, the relaxation procedure MonoRelax in Figure 2-10 can find the minimum 
solution. After initializing the X{, the procedure performs a sequence of relaxations over 
the set of constraints. Each relaxation step (lines 4-5) consists of determining a constraint 
fk( x i) > gk{x\,x 2 , . . . ,x n ) which is violated and then incrementing X{. If there is a solu- 
tion, the running time is proportional to X]™=i x~i multiplied by the time it takes to find 
a violated constraint. If there is no solution, however, the procedure runs forever. Later 
in this section, we shall present a procedure based on MonoRelax to solve the retiming 
problem. This new procedure always terminates, however, because for the special case of 
the retiming problem we can prove that whenever the unknowns X{ exceed an upper bound, 
then the problem has no solution. The following theorem proves that MonoRelax finds 
the minimum solution when a solution exists. 

Theorem 34 Let S be the set of constraints in a simple integer monotonic programming 
problem over the unknowns x x ,x 2 , . . . ,x n , and suppose S has a solution. Then, MonoRe- 
lax finds the minimum solution x x ,x 2 , . . . ,x n . 

Proof. We first show that after each iteration the invariant Xi > X{ holds for i = 1, 2, . . . , n. 
The invariant initially holds, since X{ = for every i. Consider an iteration of the pro- 
cedure in which a constraint fk(xi) > gk( x i,x 2 , . . . ,x n ) is violated, which means fk(xi) < 
gk{x\,x 2 , . . . ,x n ). Since no constraints are violated for the minimum solution x, and since 
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MonoRelax(S') 

1 for i <- 1 to n 

2 do x t <- 

3 while there exists an unsatisfied constraint in S 

4 do pick an unsatisfied constraint fk(%i) > gk( x i,x 2 , ■ ■ ■ ,x n ) 

6 return (xi,x 2 , ■■■ ,x n ) 

Figure 2-10: Procedure MonoRelax for solving a simple integer monotonic program 5 over 
unknowns X\ , x 2 , . . . , x n . The procedure returns a solution if and only if the constraints in 5 can be 
satisfied. Otherwise, it runs forever. 



g k is monotonic and X{ > Xi, we have 

fk( x i) > g k (x 1 ,x 2 ,... ,x n ) 

> 9k(xi,x 2 ,... ,x n ) 

> fk(Xi) . 

But since f k is monotonic, fk(x~i) > fk( x i) implies that Xi > X{. Thus, after incrementing x t , 
the invariant x~i > X{ continues to hold. 

To complete the correctness proof, observe that when all constraints are satisfied, we 
obtain a solution x, which, by Corollary 33 and the fact that X{ <x~i, is equal to x. More- 
over, we must eventually achieve this unique minimum solution, because exactly J27=i ^ 
relaxations can occur, since each relaxation increases J27=i x i by exactly 1. □ 

We now turn to the retiming problem with general two-phase clocking schemes. Recall 
that given a circuit G = (V, E, d,w, \) and a clocking scheme ir = (</>o,7o, </>i,7i), we wish to 
compute a retiming function r such that G r is a well-formed circuit which is properly timed 
by 7r, or else determine that no such retiming exists. When ir is not symmetric, Inequalities 
(2.2) and (2.3) cannot be simplified as in the retiming problem with symmetric clocking 
schemes. Nevertheless, we can cast the retiming problem as a simple integer monotonic pro- 
gramming problem. The following lemma gives a set of necessary and sufficient conditions 
that such a retiming r must satisfy. 

Lemma 35 LetG = (V,E,d,w,x) be a two-phase, level- clocked circuit, let it = (</>o,7o, <^i,7i) 
be a clocking scheme, and let r : V — > Z be a retiming function. Then, the retimed circuit 
G r is properly timed by ir if and only if for every edge u — > v € E, we have 

r(u) - r(v) < w(e) , (2.26) 

and for every path u ~» v, we have 

dip) < W -^l\+^ x(u) 



+7T 



r(v) 



+ (r(v) mod 2) (j x(u) + ^_ 



X(«)y 
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r(u) 



(r(u) mod 2) (4> x(u) + 7xm) 



(2.27) 



ifx(u) 7^x0), <™d 



«ri < *(^) 





2 


-A-TT 


r(i;) 




2 


— 7T 


r(u) 

2 



7i-x(«) 



+ (r(u) mod 2) (71-xH + <£ x („)) 
- (r(u) mod 2) (</> x(u) + j x(u) ) , 



(2.28) 



*/x(«) =x(v). 

Proof. (=>) Let G r be a well-formed circuit that is properly timed by 7r. Since G r is 
well formed, all edge-weights w r in G r must satisfy w r > 0, which by Equation (2.17) 
implies Inequality (2.26). Since G r is properly timed by it, Inequalities (2.2) and (2.3) from 
Lemma 14 must hold. We prove that Inequalities (2.27) and (2.28) follow from Inequalities 
(2.2) and (2.3). 

The proof is a case analysis that depends on examining all possible assignments of 
original phases to u and v and all possible parities of r(u) and r(v) for a path u ~> v in G r . 
Let us consider, for example, a path u ~> v, where x( u ) ¥" x( v )- Let r be a retiming with 
even r{u) and odd r(v); the other cases are similar. In this case, Xr( u ) = x( u )> an d Xr( v ) = 
1 — x( v )- Using Inequality (2.3) in Lemma 14 and the fact that w r (p) = w(p) +r(v) — r(u), 
we obtain 



d(p) < it 



2+w r (p) 



7i-x-W 



2 + w(p) + r(v) — r(u) 



7xM 



Since x(' u ) 7^ xMj we have 7 X („) = vr — <^> x ( u ) — 7x(«) ~~ ^xM by Equation (2.1), and therefore 

'2 + id (p) + r(v) — r(v,y 



d(p) < it 



2 + w(p) + r(v) — r(u) 

2 
2 + w(p) + r(t>) — r(u) 



+ <Px(u) + 7x(«) + <Px(v) - n 



+ <t> x (u) + 7x(«) + ^i-x(«) ~ ^ 



1 + <£ x («) + 7x(«) + 0i-x(«) 



1 + to (p) + (r(u) — 1) — r(-u) 



+ </>x(«) +7xM +4>i-x(u) 



+7T 



1 +tt)(p) 

2 
r(v) — 



r(u) 



+ ■ 



>X(U) 



+ lx(u) + 4>1-X(u) 
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Using the fact that r(v) mod 2 = 1 and r(u) mod 2 = 0, we obtain 

1 +w(p)* 



<%) < ,(i±P) 





2 


-A-TT 


r(v) 




2 


— 7T 


r(-u) 

2 



+ 



; xM 



+ (r(u) mod 2) (j x(u) + </>i- x(u) ) 
- (r(u) mod 2) (</> x(u) + j x(u) ) , 



thus proving Inequality (2.27). 

(•<=) All the implications in the proof of the forward direction can be reversed. □ 

Inequalities (2.27) and (2.28) can be intuitively understood in terms of rise-to-fall times. 
As we saw in the proof of Lemma 35, Inequality (2.27) follows from Inequality (2.2), which 
holds for every path « ~> d in G with x( u ) ¥" x( v )- The first line on the right-hand-side 
of Inequality (2.27) is just the initial allowance of time along p. The second line gives the 
net allowance of time added to the path p due to the latches that are shifted onto or off 
p through v. The term [rM/^J vr counts the number of whole periods that are shifted 
through v, and the other term accounts for the effect of shifting fractional periods. By 
shifting onto p a single latch through the last vertex v in the path, we have a net allowance 
of 7 X ( U ) +</ ) i-x(«)) since the new latch on the boundary ofp is clocked on 4>i- x ( u ). Similarly, 
by shifting off p a single latch through v, the net allowance is — 7i- x ( u ) — 4>x(u)i since the 
<^ x („)-latch that used to be on the boundary ofp is no longer there. Finally, the third line 
on the right-hand-side of Inequality (2.27) gives the net allowance of time added to the 
path p due to the latches that are shifted onto or off ofp through u. The interpretation of 
Inequality (2.28) is analogous. 

Although Lemma 35 provides necessary and sufficient conditions that a retiming must 
satisfy, the set specified by Inequalities (2.26), (2.27), and (2.28) contains an infinite number 
of constraints. The following lemma shows that of this infinite number, all but 0(V 2 ) are 
redundant. 



Lemma 36 LetG = (V,E,d,w,x) be a two-phase, level- clocked circuit, let it = (</>o,7o, </>i,7i) 
be a clocking scheme, and let r : V — > Z be a retiming function. Moreover, let p be the short- 
est (least-weight) path from u to v in the graph G' = (V,E,w') with edge-weight function 
w'{e) = irw(e)/2 — d(j) for each edge i — > j in E. Then, the retimed circuit G r is properly 
timed by ir if and only if for every edge u — > v € E, Inequality (2.26) 

r(u) — r(v) < w(e) 

holds, and for every pair of vertices u, v G V, we have 

1 +w(p)" 



<%) < .(i±^) 



+ 



'x(u) 



+7T 



r(v) 
r(u) 



+ (r(v) mod 2) (j x(u) + 4>i- x{u )) 
- (r(u) mod 2) (4> x(u) + j x(u) ) 



(2.29) 
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ifx(u) y^x(v), and 



( 2+w{p) \ 
d(p) < vr ^ J - 7i-x(«) 





2 


-A-TT 


r{v) 




2 


— 7T 


r(u) 

2 



+ (r(v) mod 2) (71-xH + 0x(«)) 
- (r(u) mod 2) (</> x(u) + 7 x(u) ) , 



(2.30) 



«/x(«) = x(^)- 



Proof. We prove that Inequality (2.29) holds if and only if Inequality (2.27) holds. Inequal- 
ity (2.27), which must hold for every path u ~> v in G, can be rewritten 



r(u) 



+ (r(u) mod 2) (4> x(u) + j x(u) ) 



r(v) 

< 7T 



(r(v) mod 2) (j x(u) + ^i_ x( „)) 



w(p) 



d{p) 



= E (^^ - d U)) - d ( u ) • ( 2 - 31 ) 

Among all paths from u to v, the tightest constraint is generated by the path p that 
minimizes the sum on the right-hand side of Inequality (2.31). This path is the shortest 
(least-weight) path from u to v in G'. Therefore, Inequality (2.29) holds exactly when 
Inequality (2.27) holds. The proof for Inequality (2.30) is similar. □ 

The next lemma shows that the retiming problem can be reduced to the simple integer 
monotonic programming problem. 

Lemma 37 The retiming problem for a two-phase, level-clocked circuit G = (V,E,d,w,x) 
and a clocking scheme ir = (</>o,7o, <^i,7i) can be reduced to the simple integer monotonic 
programming problem. 

Proof. The retiming problem for G and ir is described by Inequalities (2.26), (2.29), 
and (2.30). In order to prove the lemma, we must show that each of these inequalities 
can be written in the form f(r(v)) > g(r(u)), where / and g are monotonic functions. 
For every edge u —* v G E, Inequality (2.26) can be written as 

r(v) + w(e) > r(u) , 

which has the desired form, since both sides of the inequality are monotonic. 

Let us concentrate, now, on Inequality (2.29). The analysis for Inequality (2.30) is 
similar. Inequality (2.29) can be written in the form f(r(v)) > g(r(u)) by letting 



f(r(v)) 
g{r{u)) 



r(v) 
r(u) 



+ (r(v) mod 2) (j x(u) + 0i_ x(u) ) + vr ( j - d(p) + <f> x(u) 



+ (r(u) mod 2) (4> x{u) + 7 x(u) ) . 
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Retime(G,7t) 

1 Q <— {constraints from Lemma 36 of the form u f(r(v)) > g(r(u))"} 

2 for every vertex v G V 

3 do r(v) <— 

4 while Q ^ 

5 do remove a constraint "/(r(t>)) > g(r(u)) v from Q 

6 if f(r(v)) < g(r(u)) 

7 then repeat r(v) <— r(v) + 1 

8 ifr(w) > 3|V| - 1 

9 then fail 

10 until f(r(v)) > g(r(u)) 

11 Q <— Q U {all constraints with r(v) on the right-hand side} 

12 return r 

Figure 2-11: An algorithm which, given a two-phase circuit G = (V,E,d,w,\) and a clocking 
scheme 7r = (^o>7o>^i>7i)> determines a retiming of G that is properly timed by ir, or determines 
that no such retiming is possible. 

We demonstrate that / is monotonically increasing with respect to its argument r(v) by 
showing that the difference f(r(v) + 1) — f(r(v)) is positive for every r(v). The proof for 
the monotonicity of g with respect to r(u) is similar. 



f(r(v) + 1) - f(r(v)) 



r(v) + 1 

2 . 

r(v) 



+ i( r ( V ) + 1) m0d 2 ) hx(u) + 4>l-x(u) 

(r(v) mod 2) (j x(u) + 4>i- x ( u )) 



> IT 



r(v) + 1 



r(v) 



(r(v) mod 2) (j x(u) + 0i_ x( „)) 



= 7r (r(-u) mod 2) - (r(v) mod 2) (7 x(u) + </>i- x («)) 
= (r(t;) mod 2) (vr - 7 x(u) - 4>i- x(u) ) 
> 0, 



since [(r(t;) + l)/2j - L^(^)/2j = (r(t;) mod 2) and vr > 7 x(u) + ^i_ x( „). □ 

Figure 2-11 gives the pseudocode for Algorithm Retime which solves the retiming prob- 
lem for a given two-phase, level-clocked circuit G and a clocking scheme it. Algorithm Re- 
time operates in essentially the same way as procedure MonoRelax. The only difference 
is that Algorithm Retime can detect if the retiming problem has no solution, in which case 
it returns that the problem is infeasible. We now prove a bound on the running time of 
Algorithm Retime, and then we shall prove its correctness. 

Lemma 38 Algorithm Retime can be implemented to terminate in 0(V 3 ) time. 

Proof. To compute the constraints in line 1 of the algorithm, we need to compute the 
shortest-paths between every pair of vertices in G. This computation can be performed in 
0(VE+V 2 lg V) time using Johnson's algorithm for all-pairs shortest-paths [5, Section 26.3]. 
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To implement the set Q, we can use a FIFO queue and a flag for each constraint that 
indicates whether the constraint is in the queue. 

For every constraint f(r(v)) > g(r(u)) G Q, the functions f(r(v)) and g(r(u)) can 
be computed in O(l) time, and since we have \V\ variables, the body of the while loop 
(line 4-line 11) can be completed in 0(V) time. Since the repeat loop in line 7 always 
increments some previously incremented variable r(v), and since no variable r{v) becomes 
greater than 3|V|, line 7 is executed 0(V 2 ) times. Therefore, the total running time of 
Algorithm Retime is 0(V 3 ). □ 

In order to show that Algorithm Retime terminates with the right answer, it suffices 
to prove that if the retiming problem is feasible, then there exists a minimum solution r in 
which r(v) < 3|V| — 1 for all v G V, and that Algorithm Retime computes that solution. 
In the following two lemmas, we prove that these conditions are met. 

Lemma 39 Suppose that the retiming problem defined by Inequalities (2.26), (2.27), and (2.28) 
has a solution. Then there exists a solution r such that < r(v) < 3|V| — 1 for all v G V. 

Proof. We first show that under the conditions of the lemma, there exists a nonnegative 
solution. Let r be a solution that satisfies Inequalities (2.26), (2.27), and (2.28). Then, for 
any integer c, the function r'(v) = r(v) + 2c for all v G V, also satisfies these constraints, 
which can be seen by direct substitution into the three inequalities. Thus, if there is a 
solution to the retiming problem, by picking c large enough, we can find a nonnegative one. 

Since there exists a nonnegative solution, there exists a nonnegative solution r which is 
minimum in the sense that for any other nonnegative solution r, we have r(v) < r(v) for 
all v G V. We shall show that r(v) < 3\V\ - 1 for all v G V. 

Assume now, for the purpose of contradiction, that there exists a vertex t G V such 
that r(t) > 3|V|. Under this assumption, we shall show that there exists a solution r' to 
Inequalities (2.26), (2.27), and (2.28) which is smaller than r, thereby contradicting the 
minimality of r . 

Define the set V t C V to include t and all vertices that can reach t in G r using only 
those edges u — * v for which r(v) — r(u) < 3. Let p be an arbitrary simple path from a 
vertex x G V t to t. Summing the inequalities r(v) — r(u) < 3 for every edge u — * v along p, 
we obtain 

r(t)-r(x) < J2 3 

u — >v£p 

< 3(|V|-1) 

since the sum telescopes and a simple path has at most \V\ — 1 edges. By assumption, 
r(t) > 3|V|, and hence, r(x) > 3. Thus, all vertices in V t are retimed by at least 3 in the 
minimal retiming r. 

Let us define a new function r' by the equation 

f r(v) ttv€V-V t : 

r'(v) = (2.32) 

[ r(v) - 2 if v G V t . 

The function r' is nonnegative, because r(v) > for all v G V — V t and r(v) > 3 for all 
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v G V t . Moreover, r' is nowhere larger than r, and it is strictly smaller because t G V t implies 
r'(t) < r(t). Thus, if we can show that r' satisfies Inequalities (2.26), (2.29), and (2.30), we 
obtain the contradiction we desire. 

Let us compare the circuits G r * and G-. For an edge u — > v, we have three possible 
situations: 

• w r i(e) = w-(e) if u,v G V t or u, v G V — V t ; 

• w r > (e) = w-(e) + 2 if u G V t and v G V — V t ; 

• w r > (e) = w-(e) — 2 if u E V — V t and v G V t . 

Observe that in the third case, w r > (e) > 0, since the definition of V t implies that r(v) —r(u) > 
4, and thus, 

w-{e) = w(e) +r(v) —r{u) 

> w(e)+4 

> 4 . 

Thus, r' is a legal retiming. 

We wish to show that if G- is properly timed by a clocking scheme it, then so is G r >. The 
circuit G r i differs from G ¥ by the addition of 2 latches on some edges and the subtraction 
of 2 latches on some other edges that have at least 4 latches on them in G-. By Lemma 16, 
the 2-latch subtraction does not affect proper timing. Adding a pair of latches on an edge 
does not affect Condition WF3 (clock phases alternate along paths), and by Lemma 14, 
increasing the number of latches on a path cannot violate proper timing. Thus, G r > is 
properly timed by it. 

Thus, by Lemma 35, Inequalities (2.26), (2.29), and (2.30) hold for r', which contradicts 
the minimality of r and completes the proof. □ 

We conclude this section with the following theorem. 

Theorem 40 The retiming problem can be solved for a two-phase, level-clocked circuit 
G = (V,E,d,w,x) and a clocking scheme it = (4> , 7q, (j>i,7i) in 0(V 3 ) time. 

Proof. From the invariant in the proof of Theorem 34 we have that r(v) < r(v) for all 
v G V , at every point during the execution of Algorithm Retime. Since r(v) < 3 \V\ — 1, 
according to Lemma 39, we conclude that whenever a variable r(v) exceeds 3 \V\ — 1 during 
Algorithm Retime, the monotonic program must be infeasible. It follows that Algorithm 
Retime computes the minimum retiming r or correctly discerns that no solution exists. 
The running time of Algorithm Retime follows from Lemma 38. □ 

2.8 Retiming for Minimum Latch Count 

In this section we consider the retiming problem for minimum latch count: Given a two- 
phase circuit G = (V,E,d,w,x) and a symmetric clocking scheme ir = ((f), 7, cf>, 7), we wish 
to compute a retimed circuit G r that is properly timed by ir and uses the minimum number 
of latches. We show that this problem can be solved in 0(V 3 IgV) time by reducing it to 
the dual of an uncapacitated minimum-cost flow problem. 
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The following lemma gives necessary and sufficient conditions for a retiming G r to have 
the minimum number of latches. 

Lemma 41 Let G = (V,E,d,w) be a two-phase circuit, let ir = (</>, 7, </>, 7) be a symmetric 
clocking scheme, and let r : V — > Z be a retiming function. Then, the retimed circuit G r 
achieves ir with the minimum number of latches under any other retiming if and only if the 
assignment r minimizes the expression 

\J (indegree(v) — outdegree(v)) r(v) 

subject to constraint (2.19) 

r(u) — r(v) < w(e) 

for every edge u — > v € E, and constraint (2.20) 

r{u) - r{v) < J2 (™( e ) " l d U)) + ( 2 " f (<*H + 7) 

for every path u ~> v in G. 

Proof. According to Lemma 29, the circuit G r is properly timed by ir if and only if the 
two sets of constraints in the statement of the lemma are satisfied. Moreover, the number 
of latches in G r is given by the expression 



J2 w r(e) = 5^w(e)+^(r(«)-r(«)) 

2_] w(e) + 2_, (indegree(t;) — outdegree(-u)) r(v) . 



egB egB egB 



Therefore, the lemma holds. □ 

We can now prove the following theorem. 

Theorem 42 The retiming problem for minimum latch count can be solved in 0(V 3 lgV) 
time. 

Proof. The retiming problem for minimum latch count is reduced to the dual of an un- 
capacitated minimum-cost flow problem [41] on the graph defined by Inequalities (2.19) 
and (2.20). The cost of each edge equals the right-hand side of the corresponding inequal- 
ity. The demand/supply of each vertex v equals the difference between the number of edges 
coming into v and the number of edges coming out of v. We can use a scaling algorithm 
by Orlin to solve this problem that runs in 0(V 3 lgU) steps, where U is the maximum 
demand/supply In our retiming problem, we have U < \V\, and therefore, the algorithm 
runs in 0(V 3 lgV) time. □ 

We can give polynomial-time algorithms for several other retiming problems for mini- 
mum latch count. If every fanout wire of each gate in the circuit has the same output, then 
it can be shown that retiming for minimum number of latches with maximal latch-sharing 
can be solved in 0(V 3 ) time [50]. This algorithm is more efficient than the algorithm for 
the general problem, because it solves a minimum-cost flow problem on a graph with unit 
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demands and supplies. If the objective is to achieve the minimum clock period with the 
minimum number of latches, we can find a retiming in 0(V 2 E + V 3 lg V) time. First, we 
compute a set of 0(V 3 ) possible clock periods using the 0(V 2 E)-time Algorithm PiFDR, 
and then we binary search this set for the minimum feasible clock period. 

2.9 Approximation Schemes for Minimum-Period Retiming 

In this section we present "fully polynomial-time approximation schemes" for three problems 
related to both retiming and tuning. A fully polynomial-time approximation scheme [5] is an 
optimization algorithm that takes, in addition to its other input parameters, a parameter 
e > specifying a relative error. The algorithm must produce an answer that is within 
(1 +e) of the optimal answer for the problem and must run in time polynomial in the input 
and in 1/e. 

The first problem we consider is retiming and fixed- duty-ratio tuning: Given a two- 
phase, level-clocked circuit G = (V, E, d, w,x), a real number p > 0, and gaps 7 and 71, we 
wish to compute a retiming function r and a clocking scheme ir = (</> ,7o, </>i,7i), such that 
G r is properly timed by ir, and ir has the minimum period among all clocking schemes of 
duty-ratio p that can be achieved by retiming the circuit G. We give an 0(V 3 lg(V/e))-time 
algorithm that, for any given relative error e > 0, computes a retiming r and a clocking 
scheme ir with duty ratio p, such that G r is properly timed by ir, and the period of ir is 
at most (1 + e) times the optimal period. The same algorithm can be used to solve the 
retiming and fixed- duty- cycle tuning problem, in which instead of the duty-ratio p we are 
given one of the phases fa or fa . 

The second problem we consider is retiming and symmetric tuning: Given a two-phase, 
level-clocked circuit G = (V,E,d,w,x) and gap 7, compute a retiming function r and a 
symmetric clocking scheme ir = (</>, 7, </>, 7), such that G r is properly timed by ir, and ir has 
the minimum period among all symmetric clocking schemes with gap 7 that can be achieved 
by retiming G. This problem is a special case of the retiming and fixed-duty-ratio tuning 
problem, and the same basic algorithm can be applied to yield a retiming r and a clocking 
scheme ir with period at most (1 + e) times the optimal period, for any given relative error 
e > 0. In this case, the algorithm terminates in 0((VE + V 2 lgF) lg(F/e)) steps. 

Finally, we consider the general retiming and tuning problem: Given a two-phase, level- 
clocked circuit G = (V, E, d, w, \) and gaps 70 and 71, we compute a retiming function r and 
a clocking scheme ir = (^0,70,^1,71), such that G r is properly timed by ir, and ir has the 
minimum period among all clocking schemes that can be achieved by retiming G. We give 
an 0(V 3 (l/e) lg(l/e) + (VE + V 2 \gV) lg(V/e))-time algorithm that, for any given relative 
error e > 0, computes a retiming r and a clocking scheme ir with period at most (1 + e) 
times the optimal period. 

2.9.1 Retiming and Fixed Duty-Ratio Tuning 

Algorithm R&FDRT, given in Figure 2-12, approximately solves the retiming and fixed 
duty-ratio tuning problem using a binary search over a space of possible clock periods. Since 
the problem requires that only fixed duty-ratio clocking schemes be considered, each possible 
clock period corresponds to a unique clocking scheme. The algorithm checks whether this 
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R&FDRT(G,p, 7o , 7l ,e) 



1 


^ma> 


: <— max^gy d(v) 


2 


7T+< 


— |^|(i max +7 +7! 


3 


4>o *■ 


-(vr + -7o-7i)/(l+p) 


4 


7T~ < 


- 7o + 7i 


5 


while tt + — 7T _ > ed max 


6 




do 7T <- (7T+ + 7r")/2 


7 




^o^(vr-7o-7i)/(l+p) 


8 




if Retime(G, (0d,7o,P0d,7i» + fail 


9 




then tt + <— it 


10 




4>o <- </>d 


11 




else 7r~ <— 7T 


12 


r <— 


Retime(G, (</>o , 7o , P<^o , 7i > ) 


13 


return 7r and r 



Figure 2-12: Algorithm R&FDRT, which solves the retiming and fixed duty-ratio tuning problem. 
The algorithm takes as input a two-phase circuit G = (V,E, d,w, \), a duty ratio p, gap widths 70,71, 
and a relative error e > 0. It computes a retimed circuit G r and a period 7r such that G r is properly 
timed by a clocking scheme whose period is ir, whose gap widths are 70,71, and whose duty ratio is 
p. The period 7r is guaranteed to be at most (1 + e) times the period of any clocking scheme that 
can be achieved by G under retiming and whose duty ratio is p. 



clocking scheme is achievable using Algorithm Retime from Section 2.7. Binary search can 
be employed because if a given clock period is achievable, every greater clock period is also 
achievable. 

Algorithm R&FDRT binary searches over a range of clock periods that is guaranteed 
to include the minimum clock period ir*. As a lower bound on ir*, the algorithm uses 
7o + 7ij which follows from Equation (2.1). As an upper bound, the algorithm uses the 
value \V\ c? max +70 + 71, where c? max is the maximum delay of any individual functional 
element. To see that a clock period of IV^c^x +70+71 is always achievable for any 
duty ratio, we consider the inequalities in Lemma 18. For each inequality, we choose the 
largest left-hand side — at most ll^c^x — and the smallest right-hand side — w(p) = 1 in 
Inequality (2.6), and w(p) = in Inequality (2.7), and w(c) = 2 in Inequality (2.8). With 
the value ir = \V\ c? max + 70 + 71, the inequalities are all satisfied. 

It remains to show that the search terminates when it has isolated a sufficiently accurate 
approximation ir of the true minimum ir* . The algorithm maintains that approximation ir 
and the optimal clock period 7r* both fall within an interval [7r~,7r + ]. The search terminates 
when 7r + — 7r~ < ed max , at which point we have 

< 7r~ + e (i max 

< 7T* + e (i max 

< 7T*+e7T* 

< (l + e)vr*. 



2.9. APPROXIMATION SCHEMES FOR MINIMUM-PERIOD RETIMING 71 



R&ST(G,7,e) 


-L ^max 


<— max^gy (i(t>) 


2 7T+ * 


-|^|^max + 27 


3 </> <- 


- 7T+/2 -7 


4 7T~ * 


-2 7 


5 while tt + — 7T _ > ed max 


6 


do 7T <- (7T+ + 7r")/2 


7 


#, <- vr/2 - 7 


8 


ifRwSCS(G,(^, 7 ,^,7»^fai 


9 


then 7r + <— 7T 


10 


</>o <- </>0 


11 


else 7T _ <— 7T 


12 r^ 


RwSCS(G, (0o,7,0o,7» 


13 return tt and r 



Figure 2-13: Algorithm R&ST, which solves the retiming and symmetric tuning problem. The 
algorithm takes as input a two-phase circuit G = (V,E,d,w,\), a gap width 7, and a relative 
error e > 0. It computes a retimed circuit G r and a period 7r such that <3 r is properly timed by a 
symmetric clocking scheme whose period is 7r and whose gap widths are both 7. The period 7r is 
guaranteed to be at most (1 + e) times the period of any symmetric clocking scheme that can be 
achieved by retiming G. 

Algorithm R&FDRT runs in 0(V 3 lg(V/e)) time, as can be seen by the following anal- 
ysis. Lines 1-4 require only 0(V) time. Each execution of the while loop (lines 5-11) 
is dominated by the 0(V 3 )-time call to Retime. The while loop is executed 0(lg(V/e)) 
times, since the range between the initial upper and lower bounds on ir is \V\ c? max , and we 
continue the search until the range is ed max , dividing the range by 2 in every iteration. 

Algorithm R&FDRT can be adjusted to find an exact solution when the propagation 
delays of the combinational elements are integers. Specifically, the retiming and fixed-duty- 
ratio tuning problem can be solved exactly in 0(V 3 lg(Vd max /6)) time, where 

7o 7i 

mm - 



3|V| - 1' 2(3 |V| - l) 2 ' 2(3 |V| - 1) 

Using the integrality of the delays, it is straightforward to show that the clock period of 
any two clocking schemes defined by Inequalities (2.11) and (2.12) must differ by at least 6. 
The proof relies on the fact that the optimal clocking scheme is defined by the intersection 
of Equation (2.1) with one of the lines defined by Inequalities (2.11), (2.12), and (2.8). 

2.9.2 Retiming and Symmetric Tuning 

The retiming and symmetric tuning problem is a special case of the retiming and 
fixed duty-ratio problem, and therefore, it can be solved with Algorithm R&FDRT in 
0(V 3 lg(vye)) time. We can solve this special case in 0((VE + V 2 \gV) \g(V/e)) time using 
Algorithm R&ST shown in Figure 2-13. 

The new algorithm R&ST(G,7,e), takes as arguments the circuit G, a value 7 for the 
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gaps between the two phases, and a relative error e. It is essentially identical to Algorithm 
R&FDRT, except that the calls to Algorithm Retime are replaced by calls to Algorithm 
RwSCS. Since retiming with symmetric clocking schemes can be accomplished in 0(VE + 
V 2 lg V) time, the same analysis as for Algorithm R&FDRT yields a running time of 
0((VE + V 2 lg V) \g(V/e)) for Algorithm R&ST. 

Algorithm R&ST can be adjusted to find an exact solution when the propagation delays 
of the combinational elements are integers. Specifically, the retiming and symmetric tuning 
problem can be solved exactly in 0((VE + V 2 lgF) lg(Vd max /6)) time, where 

7o 7i 

mm • 



3|V| - 1 2(3 |V| - l) 2 2(3 |V| - 1) 

The proof relies on the integrality of the delays and the fact that the optimal clocking scheme 
is given by the intersection of Equation (2.1) with one of the lines defined by Inequalities 

(2.11), (2.12), and (2.8). 

2.9.3 Retiming and Tuning 

We now describe a polynomial-time approximation scheme for the general retiming and 
tuning problem. Algorithm GR&T, shown in Figure 2-14, computes a retimed circuit G r 
and a clocking scheme ir such that G r is properly timed by ir and the period of ir is at most 
(1 + e) times the minimum clock period ir* that can be achieved by any retiming G r . The 
algorithm runs in 0(V 3 (l/e) lg(l/e) + (VE + V 2 lg F) \g{V/e)) time. 

Algorithm GR&T proceeds as follows. In the beginning, it employs Algorithm R&ST 
to compute a symmetric clocking scheme ir + whose period is an upper bound on ir*. Specif- 
ically, 7r + satisfies vr* < 7r + < (1 + e)vr*, where 7r* is the period of the shortest symmetric 
clocking scheme with gaps 7 = max {70, 7^ that can be achieved by a retimed circuit G r . 
Subsequently, Algorithm GR&T computes a lower bound 7r~ on the optimal clock pe- 
riod 7r*. The bounds 7r + and 7r~ are then updated in the body of the outer while loop 
(lines 7-22), as long as they differ by more than the constant 6 computed in line 6. Each 
time through the loop, the algorithm searches for a retiming r such that G r is properly 
timed by a clocking scheme of period ir = (tt + + tt~)/2. This search is conducted in the 
inner while loop (lines 11-19), which considers clocking schemes of period ir with cf> as- 
suming values in increments of 36 from the interval (0,7r — 70 — 71). If a retimed circuit G r 
is properly timed by one of these clocking schemes, then the upper bound ir + assumes the 
value 7r (lines 16-19). If no retiming of period it exists among the sampled values for cf> , 
however, then the lower bound 7r~ is increased to ir (lines 20-22). 

The linear search for cf> in increments of 36 (lines 11-19) is a key part of Algorithm 
GR&T. The following lemma, which is illustrated in Figure 2-15, is used in the correctness 
proof of Algorithm GR&T. The lemma considers a feasible clocking scheme for a circuit. 
If the clock period is lengthened by an amount A, the lemma shows that there is a range 
of width 3A of values for cf> such that proper timing is unaffected. 

Lemma 43 Let G = (V,E,d,w,x) be a circuit, and let it' = ($),7o,<^i,7i) be a clocking 
scheme such that G is properly timed by it'. Then, for any A > and cf> in the range 
cf>' — A < cf> < cf>' + 2A such that < </> < 7r — 70 — 71, the circuit G is properly timed by 
the clocking scheme ir = (</>o,7o, </>i,7i), where ir = it' + A and (pi = ir — 7 — 7j — cf> . 



2.9. APPROXIMATION SCHEMES FOR MINIMUM-PERIOD RETIMING 73 



GR&T(G, 7o , 7l ,e) 

1 dmax <- max„ ey d(u) 

2 7 <- max{7 ,7!} 

3 7r + <— period of clocking scheme returned by R&ST(G,7,e) 

4 <ftj" <— duty cycle of phase in clocking scheme returned by R&ST(G, 7,e) 

5 7r~ <— max{7r + /2(l + e),7 +71} 

6 <5 ^ 7i-e/2 

7 'while tt + — tt~ > 6 

8 do 7T <- (7T+ + 7r")/2 

9 O <- min {26, (vr - 7o - 7l ) /2} - 36 

10 feasible <— FALSE 

11 while (tt — 70 — 71) — 0o > <*> an d feasible = FALSE 

12 do > Begin linear search over O for feasible (0 o ,7r). 

13 O ^ min{0 o +3<5,7r - 70 -71 - <5} 

14 0j <- 7T - 7 - 7! - O 

15 if Retime(G, (0 o ,7 O ,0i, 71)) ^ fail 

16 then \> Found feasible (0 o ,7r). 

17 feasible <— TRUE 

18 7T+ <- 7T 

19 0+ <- 0o 

20 if feasible=FALSE 

21 then > 7r is not feasible for any O . 

22 IT' <- 7T 



23 0f <- 7T+ - 7q - 71 - 0j 

24 r ^Retime(G, (0+,7o,0+,7i)) 

25 return (0o~,7o, 0i~,7i) and r 

Figure 2-14: Algorithm GR&T for solving the general retiming and tuning problem. The algo- 
rithm takes as input a two-phase circuit G, gap widths 70 and 71, and a relative error e > 0. It 
computes a retimed circuit G r and a clocking scheme 7r such that G r is properly timed by 7r and 
the period of 7r is less than (1 + e) times the period of any clocking scheme that can be achieved by 
retiming G. 



Proof. Since G is properly timed by tt', it satisfies Inequalities (2.2) and (2.3) in Lemma 14 
for the clocking scheme ir' . In order to prove that G is properly timed by tt, we shall show 
that G also satisfies the constraints in Lemma 14 for the clocking scheme tt. 

Inequality (2.2) implies that for every path u ~> v in G r with x( u ) ¥" x( v ) an d x( u ) = 0, 
we must have 



m < *'(i±f^)) + 

1 + w(p)\ 



< IT 



2 
1 + w(p) 



(vr - vr') 


(" 


w(p)\ 
2 J 


+ 


A + 0o 









74 CHAPTER 2. ANALYZING AND OPTIMIZING LEVEL-CLOCKED CIRCUITRY 



% 



% 
71-8 

n' 



f i 36 , 36 , , , 




1 I"* >"!■< >"l 1 1 

_ _ fr ^S-_ ib- -j*3*- 6- - - ' r 

i i i\ i ^^ ii i i A 
i i i \ i ^^ ii ii 

U 1 1 \ l_^^^ J l_ u u 

L 1 1 \^^ l_ _l l_ L . L 


1 II 1 II II 
1 II 1 II II 
1 II 1 II II 
1 II 1 II II 
1 II 1 II II 
1 II 1 II II 
1 II 1 II II 


►~ 


i i i ' ~~ i~y~ i i 
i i i i ii 
i i i i ii 
i i i i ii 
i i i i ii 
i i i i ii 
i i i i ii 
i i i i ii 
i ' i i i i 
i ' — ■ ■ — i ■ — i ■ — i 







4>6 -A % 



4>6+2A 



rc-Yo-Yi 



4>o 



Figure 2-15: Illustration of the importance of Lemma 25 for the linear search in Algorithm GR&T. 
The white rectangles denote the points of the (<j>q , 7r)-plane that are visited during the linear search 
with a specified it. If a circuit G is properly timed by a clocking scheme (<j)' Q , 7o,<^4,7i), then it is 
properly timed by every clocking scheme in the shaded area. The line with slope —1 is the steepest 
among the constraints for w' on paths u -~> v with \{ u ) = an d x( v ) = 1- The line with slope 1/2 is 
the steepest among the constraints for it' on paths u -~> v with x(u) = 1 and \( v ) =0- If 7r' < 7r — S, 
then at least one of the white rectangles lies in the shaded area, and the circuit G is properly timed 
at that point. 
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since 7r > it', w(p) > 1, and cp' D — A < 4> . When x( u ) = 1, we can show in a similar way and 
using the inequality 4> < <p' Q + 2A that G still satisfies the path constraints for it. Therefore, 
G satisfies Inequality (2.2) for it. 

The constraints described by Inequality (2.3) depend on the length of the clock period 
and not on the duty cycles of the particular phases. By increasing the clock period from it' 
to it these constraints are still satisfied. 

Since G satisfies Inequalities (2.2) and (2.3) for it, Lemma 14 implies that it is properly 
timed by it. □ 

Now, we can prove the correctness of Algorithm GR&T. 



Lemma 44 Algorithm GR&T computes a solution to the general retiming and tuning prob- 
lem whose relative error is at most e. 
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Proof. Algorithm GR&T maintains two invariants during its execution. We shall use 
these invariants to show that Algorithm GR&T returns a clocking scheme ir such that ir < 
(1 + e)ir* , where ir* is the minimum clock period that can be achieved by any retiming G r . 

The first property that remains invariant during the execution of Algorithm GR&T 
is that there exists a retimed circuit G r that is properly timed by ir + = (</>o",7o ) </ , i" ) 7i)- 
Initially, the clocking scheme ir + is essentially the symmetric scheme returned by Algorithm 
R&ST, except that one of the gaps may be smaller. Shortening a gap does not affect the 
proper timing of G r , however, as long as the clock period remains the same. Subsequently, 
the parameters of the clocking scheme 7r + are updated in lines 18-19, in which case the 
algorithm has found a retiming r such that G r is properly timed by the new clocking 
scheme. Therefore, the invariant is maintained throughout the execution of the algorithm. 

The second property that remains invariant during the execution of Algorithm GR&T 
is that 7r~ < 7r* + 6. Initially, 7r~ assumes the value max{7r + /2(l + e),7 + 71}, where 7r + is 
the period of the optimal symmetric clocking scheme within a relative error e. We can show 
that this value is a lower bound for ir* as follows. The term 7 + ^y 1 is a lower bound for 
7r* by definition of the clock period. For the term 7r + /2(l + e), if we let 7r* be the period of 
the optimal symmetric clocking scheme with gaps 7 = max {70, 7j}, then since we initially 
have 7r + < 7r*(l + e), it follows that 

7r + /2(l + e) < <(l + e)/2(l+e) 
< </2- 

</2 < 7T* , 



Moreover, we have 



since if vr*/2 > 7r*, then by extending the shorter gap and the shorter duty cycle in 7r* 
to become equal to the longer gap and duty cycle respectively, we would obtain a feasible 
symmetric clocking scheme with period at most 2ir* < 7r*, thus contradicting the optimality 
of 7r*. Therefore, we initially have 7r~ < ir* < ir* + 6, and the invariant holds. During the 
execution of the algorithm, 7r~ is updated in line 22 whenever the linear search yields 
no retimed circuit G r that achieves a clock period ir = (tt + +7r~)/2. The linear search is 
performed in increments of 36, and if it fails to find any feasible point, Lemma 43 guarantees 
that any feasible clocking scheme has period at least ir — 6. Consequently, the optimal clock 
period ir* must satisfy ir* > ir — 6, and since ir > 7r~, it follows that 7r* + 6 > 7r~, and the 
invariant still holds. 

The boundary conditions of the linear search are established in lines 9 and 13 of the 
code. The duty cycle <f> is initialized to a value such that the predicate in line 11 is always 
satisfied and the inner loop is executed at least once. In the first iteration of the inner loop, 
we have < <f> < 26 and 4> 1 > 0. The linear search proceeds in increments of 36, and it 
continues until cf> attains a value such that (it — 70 — 71) — 6 < cf> < ir — 7 — 7^ 

The Algorithm GR&T terminates when 7r + — 7r~ < 6 and returns a clock period ir that 
satisfies 

IT < 7T + 

< 7T~ +8 
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< ir* + 26 

< 7T*+7T*e 

< (l+e)vr*. 

Therefore, Algorithm GR&T returns a clocking scheme whose period is within a relative 
error of e of the optimal period. □ 



Theorem 45 Algorithm GR&T can solve the general retiming and tuning problem for a 
two-phase, level-clocked circuit G = (V, E, d, w, \) with relative error e in 0(V 3 (l/e) lg(l/e) + 
(VE + V 2 lg V) Ig(V/e)) time. 



Proof. The correctness of Algorithm GR&T follows from Lemma 44. It remains to show 
that it terminates in 0(V 3 (l/e) lg(l/e) + (VE + V 2 lg V) lg(V/e)) steps. 

Lines 1-4 of the algorithm complete in (VE + V 2 lgV) \g(V/e)) time. The external 
while loop (lines 7-22) performs a binary search in the interval [7r~,7r + ]. The resolution of 
the search is 6 = 7r~e/2, and therefore, the number of potential periods is 



< 



(7o +7i) 



vr-e/2 " ((7r+- 7o - 7l )/2(l+e))(e/2) 
4(1 + 6) 

= o(i%) • 

Thus, the binary search causes lines 7-22 to be executed 0(lg(l/e)) times. The internal 
while loop (lines 11-19) performs a linear search in the interval (0, (vr — 7o — 7i)/2) in 
increments of 36. In this case, the number of points checked is at most 

(vr-7o-7i)/2 4/1 + e 



3((7r+- 7o - 7l )/2(l + e))(e/2) 3 

Therefore, for each ir, lines 11-19 are executed 0(l/e) times. Line 15 terminates in 0(V 3 ) 
time, and therefore, each iteration of lines 7-22 in Algorithm GR&T requires 0(V 3 (l/e)) 
time. Combining the time bounds, we conclude that Algorithm GR&T terminates in 

0(V r3 (l/e)lg(l/e) + {VE + V 2 \gV)\g{V/e)) time. □ 

The practical efficiency of Algorithm GR&T can be improved by updating 6 in the 
then clause of line 22 immediately after 7r~ has been updated with its new value. As 7r~ 
increases, the resolution 6 of the linear search decreases, and thus, fewer points are checked. 
The worst-case running time of the algorithm does not change, however. Although the linear 
search is inefficient, we do not know how to replace it by a more efficient binary search. 
The reason for this difficulty is that whenever a particular clock period is not achievable for 
a specific <f> , we do not know how to tell whether the duty cycle of phase should become 
shorter or longer. 
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PolyR&FDRT(G, p, 70, 7i) 

1 Use Algorithm PiFDR to compute the set H(p) of possible periods. 

2 Sort the elements in II(p). 

3 Use Algorithm Retime to binary search the elements of II (p) for the minimum period ir 
that can be achieved by a retiming r. 

4 return ir and r. 

Figure 2-16: Algorithm PolyR&FDRT for the retiming and fixed-duty-ratio tuning problem. 
The algorithm takes as input a circuit G = (V,E,d,w,\), a duty ratio p, and two gap widths 70 and 
71 . It computes a retimed circuit G r and a period 7r such that G r is properly timed by a clocking 
scheme whose period is ir, whose gap widths are 70,71, and whose duty ratio is p. The period 7r 
is guaranteed to be the minimum period over all clocking schemes that can be achieved by G after 
retiming and tuning with fixed duty-ratio p. 

2.10 Polynomial-Time Algorithms for Minimum-Period Re- 
timing 

In this section we present polynomial-time algorithms for three minimum-period retiming 
problems. We have already investigated these three problems in Section 2.9, but the algo- 
rithms we described in that section were fully polynomial-time approximation schemes. The 
polynomial-time algorithms in this section compute exact solutions to these problems. We 
first give an 0(V 2 E + V 3 lg U)-time algorithm for the retiming and fixed-duty-ratio tuning 
problem. We adapt this algorithm to solve the retiming and symmetric tuning problem 
in 0(V 2 E) time. Finally, we give an 0(U 11 )-time algorithm for the general retiming and 
tuning problem. This algorithm is interesting only from a theoretical perspective, however, 
because the degree of the polynomial in its running time renders it impractical for large 
circuits. 

2.10.1 Retiming and Fixed-Duty-Ratio Tuning 

Algorithm PolyR&FDRT, shown in Figure 2-16, solves the retiming and fixed-duty- 
ratio tuning problem by binary searching a set II(p) of 0(V 3 ) possible clock periods. The 
set II(p) is guaranteed to contain the minimum clock period that can be achieved by G 
after retiming and tuning with fixed duty-ratio p. Step 1 computes II(p) using Algorithm 
PiFDR which is described in Subsection 2.10.5. Step 2 sorts the elements in II(p), and 
Step 3 performs a binary search over the potential periods in order to identify the optimal. 
The binary search of II(p) is possible, because if a clock period ir cannot be achieved by 
retiming, then no clock period it' < ir can be achieved, for a fixed duty-ratio p. Algorithm 
Retime is used as a subroutine in the search to test whether a potential clock period can 
be achieved by retiming. 

Algorithm PolyR&FDRT runs in 0(V 2 E + V 3 lg V) time. The computation of II(p) by 
Algorithm PiFDR can be performed in 0(V 2 E) time, as it is shown in Subsection 2.10.5. 
The 0(V 3 ) elements of H(p) can be sorted in 0(U 3 lgU) time. The binary search takes 
0(V 3 lgV) time, since Algorithm Retime runs in 0(V 3 ) time. Therefore, Algorithm 
PolyR&FDRT terminates in 0(V 2 E + V 3 IgU) steps. 
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PolyR&ST(G, 7 ) 

1 Use Algorithm PiFDR to compute the set H(p) of possible periods. 

2 Use Algorithm RwSCS to perform a median-based binary search of the elements in H(p) 
for the minimum period ir that can be achieved by a retiming r. 

3 return ir and r. 

Figure 2-17: Algorithm PolyR&ST for the retiming and symmetric tuning problem. The algo- 
rithm takes as input a circuit G = (V,E,d,w,\) and a gap width 7. It computes a retimed circuit 
G r and a period 7r such that G r is properly timed by a symmetric clocking scheme whose period 
is 7r, and whose gap widths are 7. The period 7r is guaranteed to be the minimum period over all 
symmetric clocking schemes that can be achieved by G under retiming. 

2.10.2 Retiming and Symmetric Tuning 

The retiming and symmetric tuning problem is a special case of the retiming and fixed- 
duty-ratio problem in which only symmetric clocking schemes are considered. This problem 
can be solved in 0(V 2 E) time using Algorithm PolyR&ST shown in Figure 2-17. 

Algorithm PolyR&ST takes as arguments the circuit G and a value 7 for the gaps 
between the two phases. The computation of the set H(p) in Step 1 is identical to the 
computation in Step 1 of Algorithm PolyR&FDRT. In Step 2, Algorithm RwSCS is used 
as a subroutine to binary search II(p) for the minimum period that can be achieved by 
retiming G. Contrary to Algorithm PolyR&FDRT, however, the elements of II(p) are not 
sorted, because the 0(V 3 IgU) time required to sort its 0(V 3 ) elements would not allow us 
to achieve the 0(V 2 E) running time. Instead, the binary search is performed by computing 
the median of the periods still under consideration at each iteration of the search. 

Algorithm PolyR&ST runs in 0(V 2 E) time. The computation of II(p) in Step 1 
requires 0(V 2 E) time. Each iteration of the binary search in Step 2 requires 0(VE + 
V 2 IgU) time. Since the median of a set of n elements can be found in 0(n) time [2, 10, 
18], and since the binary search halves the number of periods under consideration at each 
iteration, Step 2 completes in 

E (O^— J +0(VE + V 2 lgV)j = 0(V 3 + (VE + V 2 lgV)lgV) 

= 0{V 3 + VE\gV) 

steps. Therefore, Algorithm PolyR&ST terminates in 0{V 2 E) + 0(V 3 + VElgV) = 
0(V 2 E) time. 

2.10.3 Retiming and Tuning 

Algorithm PolyR&T, shown in Figure 2-18, solves the retiming and tuning problem 
using a linear search over a space II of possible duty-cycle/period pairs. Step 1 employs 
Algorithm Pi to compute a set II of 0(V S ) pairs (^o,^") that is guaranteed to contain a 
pair corresponding to an optimal clocking scheme that can be achieved by retiming G. 
Algorithm Pi is described in Subsection 2.10.4. Step 3 performs a linear search over the 
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PolyR&T(G,7 ,7!) 

1 Use Algorithm Pi to compute the set II of possible (</>o,vr) pairs. 

2 Use Algorithm Retime to search the elements in II for the minimum period ir 
that can be achieved by a retiming r. 

3 return ir and r. 

Figure 2-18: Algorithm PolyR&T for the retiming and tuning problem. The algorithm takes as 
input a circuit G = (V,E,d,w,\) and gap widths 70,71. It computes a retimed circuit G r and a 
period 7r such that G r is properly timed by a clocking scheme whose period is ir, and whose gap 
widths are 70,71. The period 7r is guaranteed to be the minimum period over all clocking schemes 
that can be achieved by G under retiming. 

elements of II in order to identify the optimal. Algorithm Retime is used as a subroutine 
in the search to test whether a potential clocking scheme can be achieved by retiming. Even 
though this linear search is inefficient, we do not know of a way to replace it by a binary 
search, beacause if a period ir is not achievable we do not know how to tell whether <f> 
should be made longer or shorter. 

Algorithm PolyR&T runs in 0(V U ) time. The computation of II by Algorithm Pi can 
be performed in 0(V 8 ) time, as it is shown in Subsection 2.10.4. The linear search takes 
0(V U ) time, since Algorithm Retime runs in 0(V 3 ) time, and there are 0(V S ) elements 
in n. Therefore, Algorithm PolyR&T terminates in 0(V n ) steps. 

2.10.4 Possible Periods for Retiming and Tuning 

In this section we describe Algorithm Pi that constructs the set II of pairs (</>o,7r) that is used 
by Algorithm PolyR&T. We first explain the derivation of II based on the conditions for 
proper timing that must hold for every retiming of the original circuit G. Then, we describe 
the operation of Algorithm Pi, and we argue that it terminates in 0(V S ) time. Finally, we 
prove its correctness by showing that the pair (</> ,7r) corresponding to the minimum-period 
clocking scheme that can be achieved by retiming G is included among the 0(V S ) elements 
of n. 

The construction of II is based on Lemma 18 which states that a given two-phase, level- 
clocked circuit G is properly timed by a given clocking scheme ir if and only if Inequalities 
(2.6), (2.7), and (2.8) hold. The set II is obtained by considering the Inequalities (2.6), (2.7), 
and (2.8) for every possible retiming of G. The intersections of the lines defined by all these 
inequalities are guaranteed to include all optimal pairs (</> ,7r). This approach is similar to 
the approach taken for the clock tuning problem. There are two differences, however. First, 
the number of intersections increases when tuning is combined with retiming, because the 
relocation of latches introduces new constraints. Second, the intersections formed between 
constraints that correspond to different retimings are redundant. Every optimal point that 
can be achieved for some retiming, however, is included in II. 

Algorithm Pi is shown in Figure 2-19. For every constraint that is significant for the 
proper timing of some retiming of G, the set C holds its corresponding line on the (</> ,7r) 
plane. In Step 2 the set C is augmented by the single lower bound on ir that holds for every 
retiming of G and is given in Inequality (2.16). Step 3 is similar to Step 2 in Algorithm TV. 
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Pi(G, 7 o,7i) 

1. Initialize C to be the empty set of lines on the (</> ,7r)-plane. 

2. Augment C by the line corresponding to the single lower bound on the clock period ir 
that is defined in Lemma 27. 

3. For each u G V, compute DD(u,v,i) for all v G V and i = 0, 1, . . . , 3 \V\ — 3, from 
the recurrence 

DD(u, v, i) = d(v) + max < DD(u, x, i — w(e)) : x —* v and i > w(e) > . 

4. For each u,v G V, and for < i < 6 \V\ - 2, < i + / < 3 \V\ - 3, j = 0, 1, augment C 
by the line corresponding to the following constraint: 

DD(u,v,i) — d>i .„ . 

> - v ' ' \ , J if i + /is odd; 

(l+i + 0/2 

DD(u,v,i) — 7,- . „ . 

> - v ' ' \ , fj if i + / is even. 
(2 + * + Z)/2 

5. Return II = {(</> ,7r) : (^> ,7r) is an intersection between two lines in C}. 

Figure 2-19: Algorithm Pi takes a circuit G, and two gap widths 70 and 71. The algorithm runs in 
0(V S ) time and computes a set II of 0(V S ) pairs (<j>o,ir) that include the minimum-period clocking 
scheme that can be achieved by retiming G. 

First, we compute a topological sort of all edges e G E with w(e) = 0. We then execute 
a triply nested loop that computes the longest propagation delay between every pair of 
vertices u,v G V, for paths with up to 3|V| — 3 latches. The outer loop is indexed by u, 
the middle loop is indexed by i, and the inner loop is indexed by each e G E consistent 
with the topological sort order if w(e) = and in any order if w(e) > 0. Step 4 computes 
0(V 2 ) constraints for each pair of vertices u,v G V. Finally, Step 5 computes and returns 
the 0(V S ) intersections among the 0(V A ) constraints. 

We now prove a bound on the running time of Algorithm Pi. 

Lemma 46 Algorithm Pi terminates in 0(V S ) time. 

Proof. The initialization of Step 1 is computed in O(l) time. Step 2 terminates in 0(VE) 
time according to Lemma 27. In Step 3, the topological sort requires O(E) time. In the 
triply nested loop, u takes on 0(V) values, i takes on 0(V) values, and e takes on O(E) 
values. Thus, the total number of steps for Step 3 is 0(V 2 E). Step 4 requires 0(V 4 ) time. 
Finally, the 0(V S ) intersections in Step 5 are computed in 0(V S ) time. Thus, the total 
running time of Algorithm Pi is 0(V 8 ). □ 

We prove the correctness of Algorithm Pi in two lemmas. 

Lemma 47 Let G = (V,E,d,w,x) be a two-phase, level-clocked circuit. Then Inequal- 
ity (2.8) can be reduced to a single constraint that holds for any retimed circuit G r . 

Proof. According to Lemma 27, the constraints defined by Inequality (2.8) can be reduced 
to the single lower bound given by Inequality (2.16), for any given retiming of the original 
circuit G. Moreover, these constraints remain unaltered for every retiming of G, because 
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retiming does not change the propagation delay or the number of latches around any cycle 
in the circuit. Therefore, Inequality (2.8) can be reduced to a single constraint that holds 
for any retiming of G. □ 



Lemma 48 Let G = (V,E,d,w,x) be a two-phase, level-clocked circuit. Then Inequalities 
(2.6) and (2.7) lead to 0(V A ) constraints that are significant for the proper timing of any 
retimed circuit G r . 



Proof. We shall first prove the lemma for Inequality (2.6). Consider a fixed retiming G r 
of the original circuit G and a pair of vertices u,v G V with Xr( u ) = an d Xr( v ) = 1; the 
situation with Xr( u ) = 1 an d Xr( v ) = is similar. Inequality (2.6) applies in this case, and 
for every simple path u ~> v in G r , we have 

,, , /l + w (p) + r(v) — r(u)\ 
d(p) < vr w -U Li + ^ . (2.33) 



For a fixed w(p), there may exist an exponential number of such constraints, since the 
number of simple paths u ~> v with initially w(p) latches on them may be exponential. We 
can reduce these exponentially many constraints down to a single constraint by considering 
the simple path u ~> v with the longest delay. Therefore, all simple paths u ~> v in G r with 
w(p) = i must satisfy the single tightest constraint 

^^/ .n /l + w (p) + r(v) — r(u)\ 

DD(u,v,i)<Tr[ yp > V ; — +&, (2-34) 



where the longest path delay DD(u,v,i) is defined as 

DD(u,v,i) = max {d(p) : p is a path from u to v, and w(p) = i} , (2.35) 

for every pair of vertices u,v G V. Note that the longest propagation delay DD(u,v,i) 
is computed over all paths u ^ v in G r with tf (p) = i, instead of just over the simple 
paths between the two vertices. We use this definition of DD(u,v,i) for efficieny reasons: 
computing DD(u,v,i) as defined requires 0(V 2 E) steps, whereas computing DD(u,v,i) 
over simple paths is an intractable problem, since even finding a simple path with any given 
number of edges is intractable [13]. It is straightforward to verify, however, that no incorrect 
or redundant constraints are generated when DD(u,v,i) is defined as in Equation (2.35). 
If DD(u,v,i) is attained for a simple path, then Inequality (2.34) is one of the constraints 
defined by Inequality (2.33). If DD(u,v,i) is attained for a path p that is not simple, 
however, then we can show that Inequality (2.34) can be also derived from a set of simple 

paths and simple cycles. Let p consist of a simple path u ~» v and simple cycles c 1; . . . , c n , 
for some n. If the circuit is properly timed, then from Inequality (2.6) we have 

,. ., (\ + w(p') + r(v) - r(u)\ 
dip') < vr V ^ V ^ — + 4>o , 
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and from Inequality (2.8) we have 

At \ / f W ( C *) 
d(Ci) < TT [~^- 

for i = 1, . . . , n. Inequality (2.34) follows immediately by adding these inequalities by parts, 
and therefore, it is not a redundant or incorrect constraint. 

We now show that for a given pair of vertices u, v G V there are 0(V 2 ) constraints that 
determine the clock period for any retiming of the original circuit and for any initial number 
of latches on any path u ~> v. According to Corollary 17, every simple path with more 
than 3|V| — 3 latches does not determine the clock period of a two-phase circuit. Moreover, 
Lemma 39 states that for every circuit G that can be retimed to achieve a specific clock 
period, there exists a minimum retiming r that assigns integers in the interval [0, 3|V| — 1]. 
Therefore, every path with more than 6|V| — 2 latches initially can be ignored, and every 
retiming that leaves more than 3|V| — 3 latches on a path can be ignored. Thus, after 
rearranging, for each pair of vertices u,v £ V we are left with the 0(V 2 ) constraints 

vr > DD(«,v,i)-h (236) 

- (l + i+r(v)-r(u))/2 ' v ' 

where < i < 6 | V^| — 2, < i + r(v) — r(u) < 3 \V\ — 3, and i + r(v) — r(u) is odd. Since 
there are 0(V 2 ) such pairs, we have a total of 0(V 4 ) constraints. 

The constraints corresponding to Inequality (2.7) can be handled in a similar way. For 
each pair of vertices u,v G V with Xr{u) = Xr{v) = 0, we have the 0(V) constraints 

^ DD(u,v,i) -7i 

it > max- — - , , 2.37 

- i (2+i+r(v)-r(u))/2 ' v ' 

where Q < i <&\V\ — 2, 0<i+ r(v) — r(u) < 3 \V\ — 3, and i + r(v) — r(u) is even. The 
situation with Xr( u ) = Xr( v ) = 1 is similar. Since there are 0(V 2 ) such pairs, we have a 
total of 0(V 3 ) constraints. Therefore, the total number of constraints for the proper timing 
of any retimed circuit G r is 0(V 4 ). □ 

Thus, we obtain the following theorem. 

Theorem 49 In 0(V S ) time, Algorithm Pi correctly computes a set II of 0(V S ) pairs 
(</>o,7r) that includes the minimum-period clocking schemes that can be achieved by a retiming 
ofG. 

Proof. The running time of the algorithm follows directly from Lemma 46. The correctness 
of the algorithm follows directly from Lemmas 47 and 48. □ 

2.10.5 Possible Periods for Retiming and Fixed-Duty-Ratio Tuning 

In this section we describe the 0(V 2 E)-time Algorithm PiFDR that is employed by 
Algorithms PolyR&FDRT and PolyR&ST to compute the set U(p). This set has 0(V 3 ) 
elements which include the minimum clock period that can be achieved by retiming G and 
simultaneously tuning its clocking scheme with a fixed duty-ratio p. We show that when the 
duty-ratio is fixed, there are at most 0(V 3 ) constraints that determine the clock period of 
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PiFDR(G,p, 7o , 7l ) 

1. Initialize H(p) to be the empty set. 

2. Augment II (p) by the single lower bound on the clock period ir that is defined in 
Lemma 27. 

3. For each u,v £ V compute 



(u, v) = min < w(p) : u ~> v in G > . 



4. For each u £ V, compute DD(u,v,i) for all v £ V and i = 0, 1, . . . , 3 \V\ — 3, from 
the recurrence 

DD(u, v, i) = d(v) + max < DD(u, x, i — w(e)) : x —* v and i > w(e) > . 

5. Use Algorithm OddCount for each u,v £ V and for j = 0, 1, to augment H(p) by 
the lower bound on the clock period ir that is defined by the following constraint: 

(p- J + l)DD (« )t ,,i)+ 7o + 7l 

7T > max ; : — ——. , 

" i (p-J + l)(l+* + Z)/2 + l ' 

where —i m - m (u,v) < I < 3 \V\ — l,i m - m (u,v) < i < 3 \V\ — 3 — /, and i + 1 is odd. 

6. Use Algorithm EvenCount for each u,v £ V and for j = 0, 1, to augment H(p) by 
the lower bound on the clock period ir that is defined by the following constraint: 

DD(u,v,i) — 7 ,- 

tt > max — — — , 

" i {2 + i + l)/2 ' 

where — i m i n (u,v) < I < 3 \V\ — l,i m - m (u,v) < i < 3 \V\ — 3 — 1, and i + 1 is even. 

7. Return II(p). 

Figure 2-20: Algorithm PlFDR takes a circuit G, a duty-ratio p, and two gap widths 70 and 71. 
The algorithm runs in 0(V 2 E) time and computes a set II(p) of 0(V 3 ) pairs (<j>o,ir) that describe 
clocking schemes with duty-ratio p. The set II(p) includes the minimum-period clocking scheme 
with duty-ratio p that can be achieved by retiming G. 



any retiming of G. We then show how to compute the clock periods corresponding to these 
constraints in 0(V 3 ) time, and we argue that Algorithm PlFDR terminates in 0(V 2 E) 
steps. 

Algorithm PlFDR, shown in Figure 2-20, relies on the observation that in retiming and 
fixed-duty-ratio tuning the optimal clocking scheme is determined by the intersection on 
the (</> ,7r)-plane of Equation (2.1) 

vr = ^0+70 + ^1+71 
= (1 + p)4> + 7o + 7i 

with one of the lines defined by Inequalities (2.6), (2.7), and (2.8). Step 2 computes in 
0(VE) time the single intersection of Equation (2.1) with the line defined by Inequal- 
ity (2.8). For every pair of vertices u,v £ G, Step 3 computes the latch count i m - m (u,v) of 
the path u ^ v in G with the fewest latches. The maximum propagation delays DD(u, v, i) 
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are computed in Step 4 for every pair of vertices u,v £ V and for i = 0, 2, . . . , 3 \V\ — 3. 
Step 5 computes 0(V 3 ) intersections of Equation (2.1) with the lines defined by Inequal- 
ity (2.6). For each of the 0(V 2 ) pairs u,v G V, this step employs the 0(V)-time Algorithm 
OddCount, shown in Figure 2-21, to compute 0(V) clock periods that are generated when 
every path from u to v has an odd number of latches after retiming. Algorithm OddCount 
is invoked twice in this step, once for each possible phase of u after retiming. Step 6 is sim- 
ilar to Step 5. In this step, Algorithm EvenCount is employed to compute 0(V 3 ) clock 
periods from Inequality (2.7) and for paths with an even number of latches after retiming. 
The operation of Algorithm EvenCount is almost identical to that of Algorithm Odd- 
Count, the only difference being that Algorithm EvenCount operates with respect to 
Inequality (2.7) instead of Inequality (2.6). 

The correctness of Algorithm PiFDR is proved in the following lemma. 

Lemma 50 Let G = (V,E,d,w,x) be a two-phase, level-clocked circuit. Then Inequalities 
(2.6), (2.7), and (2.8) lead to 0(V 3 ) constraints that are significant for the proper timing 
of any retimed circuit G r with a clocking scheme of fixed duty-ratio p. 

Proof. We shall prove the lemma for Inequality (2.6). The proof for Inequalities (2.7) 
and (2.8) is similar. Consider a retiming G r of G and pair of vertices u, v G V with 
Xr{u) = and Xr{v) = 1. Inequality (2.6) applies in this case. Following the same steps as 
in the proof of Lemma 48, we are left with the 0(V 2 ) constraints 

vr > DD(«,v,i)-h (2 _ 3g) 

~ (l + i+r(v)-r(u))/2 ' v ' 

where < i < 6 | V^| — 2, < i + r(v) — r(u) < 3 \V\ — 3, and i + r(v) — r(u) is odd. From 
Equation (2.1) and the definition of the duty-ratio p we have 



7T -7q - 


-0i-7i 


7T -7q - 


- p4>q - 71 


7T -7q - 


"7i 



p + 1 

Thus, substitution of </> in Inequality (2.38) yields 

(p + l)DD(u,v,i)+ 7o + 7l 



7T > 



{p + !){! + i + r{v)-r{u))/2 + I ' 



where — i m i n (u,v) < r(v) — r(u) < 3 \V\ — 1, i m i n (u,v) < i < 3 \V\ — 3 — (r(v) — r(u)), and 
i +r(v) — r(u) is odd. Therefore, for any fixed I = r(v) — r(u) in the interval — i m i n (u,v) < 
I < 3 \V\ — 1, there is a single constraint 

(p + l)DD(u,v,i)+-f +- fl 

it > max — : — ——. , 

" i (p + l)(l + i + 0/2 + 1 ' 

where i ranges over all integers such that i m i n (u,v) < i < 3 \V\ — 3 — I, and i + / is odd. 
Therefore, there are 0(V) potential periods corresponding to the pair of vertices u,v G V. 
Since there are 0(V 2 ) such pairs, the total number of constraints is 0(V 3 ). □ 
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OddCount(G, u, v, j, DD, j ,j 1} i min (u, v)) 

1 Initialize S and S' to contain the line ir = 0. 

2 for i = i m i n (u, v) to 3 \V\ — 3 

3 rlr. -77 (7"\ — (P~ J +l)-D-D(«,i>,»)+7o+7i 

4 if i m in(/w, f) mod 2 = > computation for odd values of Z 

5 then 6 <— i min (-u,t>) 

6 else b <— i min (u,v) — 1 

7 for i = 3 |V| — 3 downto i m i n («,i>) with even i 

8 do if7r i (-6)>5 top (-6) 

9 then while ir^l) n 5 top (Z) < S top (l) n 5 top _i(Z) and 5 ^Empty 

10 do Pop(5) 

11 Push(iri, S) 

12 if i m in(«,i>) mod 2 = 1 > computation for even values of Z 

13 then 6 <— i m i n («,f) 

14 else 6 <— i min («,t;) — 1 

15 for i = 3 |V| — 3 downto i m i n («,f ) w^ith odd i 

16 do if7r I (-6)>^ op (-6) 

17 then while vr,(Z) n ^ op (Z) < S^(l) n ^^(Z) and 5' /Empty 

18 do Pop(S') 

19 PushfaS') 

20 return 5 and S' . 



Figure 2-21: Algorithm OddCount takes a circuit G, a function DD, gap widths 70 and 71, a 
pair of vertices u and w, the phase of u after retiming, and the minimum initial latch count of any 
path from u to v in G. The algorithm runs in 0(V) time and computes the subset of H(p) that is 
generated by an odd number of latches on any path between u and v after retiming. 



The 0(V)-time Algorithm OddCount that is employed by Algorithm PiFDR to com- 
pute the clock periods due to odd latch counts after retiming is shown in Figure 2-21. This 
algorithm computes the boundary curves of the region on the (Z,7r) plane that is defined 
by the maximization in Step 5 of Algorithm PiFDR. The boundary curves are functions of 
the variable Z, and they are returned in two stacks S and S 1 . Stack S contains the curves 
for odd values of Z, and stack S' contains the curves for even values of Z. We shall describe 
the computation of S in lines 4-11; the computation of S' in lines 12-19 is similar. The for 
statement of lines 7-11 scans the curves that correspond to even values i. The scan starts 
from the maximum value of i and proceeds towards its minimum value. In the beginning 
of the iteration for curve 7Tj, the boundary curves among 7r 3 |y|_ 3 , . . . ,7Tj_! are already on 
S. While 7T, lies above the leftmost vertex of the region that is defined by the curves on 5, 
then the top of S is popped. As soon as 7r, creates a new vertex on the left of the current 
leftmost vertex, then 7r, is pushed onto S. These conditions are checked by comparing the 
7r-coordinates of intersections between 7r, and elements of the stack. The 7r-coordinates are 
determined by the n operator in the following way. When two curves intersect for Z in the in- 
terval — i m i n (u,v) < I < 3 \V\ — 3, the operator fl returns the 7r-coordinate of the intersection. 
By convention, if two curves intersect for Z outside the interval — i m i n (u,v) < I < 3 \V\ — 3, 
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or if S top _i(l) in line 9 is empty, then the operator n returns the values — oo and 0, respec- 
tively. During the computation of S, we can keep track of the values of Z that correspond 
to vertices of the region defined by S. Thus, the maximum in Step 5 of Algorithm PiFDR 
can be found in O(l) time, for every value of /, once S and S' are known. 

The following lemma shows that Algorithm OddCount runs in linear time. 

Lemma 51 Algorithm OddCount terminates in 0(V) time. 

Proof. The time required to compute and compare intersection points is O(l), as is the 
time required to push and pop S. Each curve 7Tj is pushed onto S or S' at most once. 
Curves popped from S or S' are discarded, so each curve 7Tj is popped from S or S' at most 
once. Thus, the total time needed to process each curve is O(l), and since there are 0(V) 
curves, the total running time of Algorithm OddCount is 0(V). □ 

The following lemma is used to prove the correctness of Algorithm OddCount. 

Lemma 52 Consider the for loop in lines 1-11 of Algorithm OddCount. After each 
iteration of this loop with an even value i 1 , the stack S contains all curves ir k such that for 
every odd I in the interval —i m i n (u,v) < I < 3 \V\ — 1 and for some even k in the interval 
i 1 < k < 3 \V\ — 3, we have 

ir k (l) = max7Tj(/) , 

i 

where i ranges over all even values in the interval V < i < 3 \V\ — 3. 

Proof. The proof is by induction on the number of iterations. The base case for V = 
2 [(3 \V\ — 3) /2j holds, since 7Tj(Z) > for all i, I in their corresponding intervals. 

For the inductive step, we assume that the lemma holds for all values i > i', and we 
prove that it also holds for V . Consider the curve 7i>(Z), and let ir t (l) be the curve on the 
top of S. 

If 7Tj/(— b) < 7r t (— b), where b = 2 [i min (u,v)/2\ , then the algorithm does not push 7i> 
onto S. We will show that in this case we have 7iv(Z) < 7r t (Z) for all I > —b, and therefore 
the lemma holds. Since t > i', the definitions of 7i>(Z) and vr t (Z) and the inequality 7i>(— b) < 
Tr t (—b) imply, after some algebraic manipulation, that 

{p-^ + l){DD{u,v,t)i/2 - DD(u,v,i)t/2) + ( 1 ,+ ll )(i/2 -t/2) 2 

(p- J + l) 2 (DD(u,v,i) - DD(u,v,t)) /2 (p--» + 1) 

(2.39) 
The curves 7iv(Z) and 7r t (Z), however, have a single intersection (Zj/ jt ,7Tj/ it ), where Zj/ jt equals 
the right-hand side of Inequality (2.39), and thus, we have —b > l v<t . Since 7i>(Z) and 
7r t (Z) are continuous for Z > —b, we conclude from the inequality 7Tj/(— b) < ir t (—b) that 
7Tj«(Z) < 7r t (Z) for all Z > -b. 

Now, let us consider the situation where 7i>(— b) > ir t (— b). While the condition of the 
while statement in line 9 of the algorithm holds, the stack S is popped. We distinguish the 
following two cases: 

h',t < ~b or Zj/ it > 3 \V\ — 1. In this case, we have 7r,/(Z) > vr t (Z) for all Z in the interval 
— b < I < 3 \V\ — 1, by the continuity of 7i>(Z) and 7r t (Z) (see Figure 2-22(a)). Therefore, 
7r t (Z) is not a boundary curve when 7r,/(Z) is also considered, and it can be discarded 
without violating the lemma. 
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kt -i m in(u,v) 

(a) 




(b) 




Figure 2-22: Parts (a) and (b) illustrate the situations in which the curve 7r,<(7) lies above the 
curve ir t (l) on the top of the stack 5, resulting into popping ir t (l) off of S. In part (a), the curves 
TTii(l) and ir t (l) intersect outside the interval [—6,3 \V\ —3], whereas in Part (b) they intersect inside 
the interval. Part (c) illustrates the situation in which the curve tt 2 -» (Z) is pushed onto the stack 5. 

— b < U',t < 3 \V\ — 1. In this case we have 7r,-<(£) > vr t (/) for all / in the interval —b<l< l Vit . 
Since ir t {l) > Kv{l) for —b<l< l tit ,, where l ttt , is the /-coordinate of the intersection 
between itt(l) and the curve ir t t(l) below the top of the stack S (see Figure 2-22(b)), 
by the monotonicity and continuity of 7r t (Z) and Tr t >(l) we have l t)t i < lyj. Therefore, 
TT t (l) is not a boundary curve when 7r,-<(Z) is taken into consideration, and it can be 
discarded without violating the lemma. 

The curve 7i>(7) is pushed onto the stack S the first time the condition of the while state- 
ment is not satisfied. In this case we have ir,i(l) > ir t (l) for all —b<l< l V{t , and therefore 
the lemma holds (see Figure 2-22(c)). □ 

Corollary 53 When Algorithm OddCount terminates, the stack S contains all curves ir k 
such that for every odd I in the interval —i mm (u,v) < I < 3 \V\ — 1 and for some even k in 
the interval i m - m ( u , v ) < fc < 3 \V\ — 3, we have 

7Tfe(Z) = max7Tj(l) , 



where i ranges over all even values in the interval i m - m ( u , v ) < i < 3 \V\ — 3. Similarly, the 
stack S' contains all curves ir k such that for every even I in the interval —i m \ n (u,v) < I < 
3 \V\ — 1 and for some odd k in the interval i m i a (u,v) <k<3 \V\ — 3, we have 



n k (l) = max7rj(/) , 

i 

where i ranges over all odd values in the interval i m - m (u,v) < i < 3 \V\ — 3. 



Proof. The proof for S follows directly from Lemma 52 for V = i min (u,v). The proof for S 1 
is similar. □ 

Thus, we obtain the following theorem. 
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Theorem 54 In 0(V 2 E) time, Algorithm PlFDR correctly computes a set H(p) of 0(V 3 ) 
real numbers ir that includes the period of the optimal clocking scheme with duty-ratio p that 
can be achieved by a retiming of G . 

Proof. The initialization are performed in 0(1) time. Step 2 performs a tramp-steamer 
computation and terminates in 0(VE) time. Step 4 can be performed in 0(V 2 E) time. 
According to Lemma 51, Step 5 terminates in 0(V 3 ) time, since there are 0(V 2 ) pairs 
u,v G V. Therefore, the total running time of Algorithm PlFDR is 0(V 2 E) time. 

The correctness of the algorithm follows directly from Lemma 50 and Corollary 53. □ 

2.11 Multiphase clocking 

Many of the results for two-phase clocking can be generalized to multiphase clocking disci- 
plines. In this section, we sketch the formal framework for multiphase clocking and how the 
various two-phase algorithms for timing verification and optimization can be generalized. 
We derive a verification algorithm for A;-phase clocking schemes that runs in O(kVE) time on 
a "simple" A;-phase circuit with \V\ combinational elements and \E\ interconnections. Clock 
tuning for simple A;-phase circuits can be performed, but our best algorithms to date require 
general linear programming. The algorithms for retiming to achieve a given clocking scheme 
can be generalized to run in 0(VE + V 2 lg V) time when the given clocking scheme is sym- 
metric, and in 0(kV 3 ) time when the given clocking scheme is not symmetric. The retiming 
and symmetric tuning problem can be approximately solved in 0({VE + V 2 lg V) \g{kV/e)) 
time, where e is the relative error of the approximation. The retiming and fixed-duty-ratio 
tuning problem can be approximately solved in 0(kV 3 lg(kV/e)) time. The approxima- 
tion scheme for the simultaneous clock tuning and retiming of two-phase circuits can also 
be generalized, but the resulting polynomial running time is impractically large. Some of 
the results in this section are similar to results obtained independently by Lockyear and 
Ebeling [32]. 

We define a k-phase clocking scheme to be a 2A;-tuple ir = (fa,^o, <^i;7i; • • • , <t>k-i->lk-i) 
of real numbers that satisfies the following constraints: 

CS1. vr = E"=o(^+7«); 

CS2. < fa < 7T for i = 0, 1, . . . , k - 1; 

CS3. fa + 7^ > for i = 0, 1, . . . , k - 1; 

CS4. 7, + <f> i+1 > for i = 0, 1, . . . , k - 1; 

CS5. cf>i + 7^ + cf> i+ i < ir, for i = 0, 1, . . . , k — 1; 

where we assume here and henceforth that addition in subscripts is performed modulo k. 
In this formulation, we allow overlapping phases, that is, the 7, "gaps" can be negative. 
Condition CS3 says that phases rise in order from to k — 1, and Condition CS4 says that 
they fall in order. For the simple A;-phase circuits we consider, Conditions CS3 and CS4 
are not both strictly necessary. One can show that if phases rise in one order, there is no 
advantage to having them fall in another order. Condition CS5 guarantees that at least one 
of the k phases is down at any point in time; for two-phase circuitry, this constraint reduces 
to 7, > for i = 0,l. We say a simple A;-phase clocking scheme is symmetric if fa = <f> and 
7i = 7 for all some constants <f> and 7 and alii = 0, 1, . . . , k — 1. 
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A simple k-phase circuit has the following properties: 

MCI. It employs a simple A;-phase clocking scheme. 
MC2. It contains no latch-free cycles. 

MC3. Any purely combinational path that begins with a latch controlled by some fa ends 
with a latch controlled by fa + \. 

We denote such a simple A;-phase circuit by G = (V,E,d,w,x), where \ now maps each 
vertex to a number in {0,1,..., A; — 1} corresponding to the input phase of the vertex. 
Condition MC3 ensures that the latches on any path are clocked in order by the phases in 
the clocking scheme. 

We now summarize the generalizations of our algorithms for two-phase circuits to simple 
A;-phase circuits. 

Clock period constraints. Inequalities (2.2) and (2.3), which provide necessary and 
sufficient conditions for the proper timing of two-phase circuits, generalize straightforwardly 
to multiphase circuits. Let G = (V,E,d,w,x) be a simple A;-phase circuit that employs a 
clocking scheme ir = (^0)7o>^i)7i) • • • > <^a-i>7*:-i)- Then G is properly timed if and only if, 
for every path u ~> v in G, 

d(p) < it 



w(p) + 1 



+ 4>(x(u),x(v)) , (2.40) 



k 
where ip(i, j) = fa + 7; + fa +1 -\ h 7j + fa +1 . 

Timing verification. To check whether a simple multiphase circuit is properly timed by 
a simple A;-phase clocking scheme, we can use the constraints defined by Inequality (2.40) to 
derive a set of cyclic constraints identical to those described by Inequality (2.8), except with 
k in the denominator instead of 2. These cyclic constraints can be checked in 0(VE) time 
as in Step 4 of Algorithm Verify. If the cyclic constraints are met, the path constraints 
(2.40) need only be satisfied by simple paths. The D(v, i) values in Algorithm Verify must 
now be computed for i = 0, 1, . . . , (2k — 1)|V|, and the entire verification algorithm runs in 
O(kVE) time. 

Sensitivity analysis. The sensitivity analysis algorithms that we presented for two-phase 
circuitry can be extended for A;-phase clocking schemes. For a single gate, noncritical sensi- 
tivity analysis can be performed in O(kVE) time, since the D(v,i) and D'(v,i) values must 
now be computed for i = 0, 1, . . . , (2k — 1)|V|. For all gates, noncritical sensitivity analysis 
can still be performed in 0(VE + V 2 \gV) time with appropriate choice of the edge-lengths 
for the all-pairs shortest-paths algorithm. Critical sensitivity analysis can be performed in 
O(kVE) time for a single gate. 

Clock tuning. Clock tuning for simple multiphase circuits can be performed in much 
the same way as for two-phase circuits, but the problem no longer succumbs to a simple, 
two-dimensional linear program. Linear programming still suffices to solve the problem, 
however. Each fa becomes a variable in a linear program, which can be solved with standard 
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techniques [41]. If the number of phases is assumed to be fixed, then the problem can 
be solved in 0(kV 2 ) steps, that is, in a number of steps proportional to the number of 
constraints for proper timing [36]. Certain special cases can be handled without resorting to 
general linear programming. For example, a circuit with a three-phase, nonoverlapping clock 
can be tuned in OiVE) time using the three-dimensional linear programming algorithm of 
Megiddo [37]. 



Retiming with symmetric clocking schemes. When clock phases are symmetric, 
simple A;-phase circuits can be retimed to achieve a given clock period in 0(VE + V 2 lgF) 
time, independent of k. The constraints that must be solved are a natural analog of the 
two-phase constraints described by Inequalities (2.21), (2.22), (2.23), and (2.24): 

r(u) — r(v) < w(e) for all u — > v € E 

R(v) - r(v) < for all v <E V 

R(u) - R(v) < w(e) - ±d(v) for all u -4 v G E 

r(v) - R(v) < ±(4> - d\v)) + 1 for all v <E V . 

As with the two-phase constraints, these constraints can be solved using the algorithm 
MILP from [30]. 



Retiming (with arbitrary clocking schemes). Even when clock phases are not sym- 
metric, we can retime to achieve a given simple A;-phase clocking scheme in 0(kV 3 ) time. 
For simple multiphase circuits, inequalities analogous to Inequalities (2.29) and (2.30) can 
be formulated as simple summations. In particular, Inequality (2.40) can be rewritten as 



2^ (7i + <Pr+l) ~ (7x(«) + < Px(^)+i) + n 



=xM 



k 



+ ^(x(u),x(v)) -d{jp) 



X(u)-l+r(u) 

E 

«=x(«)-i 



> J2 (& + 70 - (<k(«)-i + 7x(«)-i) ( 2 - 41 ) 



for any retimed circuit. These constraints can be solved using integer monotonic pro- 
gramming. By maintaining the two summations dynamically, the claimed running time of 
0(kV 3 ) can be obtained, where the additional factor of k stems from the fact that some 
r(v) can now be as high as 0(kV). 



Retiming for minimum latch count. For A;-phase, symmetric clocking schemes, we 
can prove that this problem amounts to computing an assignment r : V — > Z such that 



\J (indegree(-u) — outdegree(-u)) r(v) 



v€V 

is minimized, subject to 



r(u) — r(v) < w(e) 
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for every edge u — > v £ E, and 

r(u) — r(v) < 



w(p) - - (d{p) -<f>) + 1 

7T 



for every path u ~> v in G. This problem is still the dual of an uncapacitated minimum-cost 
flow problem, and it can be solved in 0(V 3 lg V) time. 

Retiming for minimum clock period. The algorithms for retiming a simple A;-phase 
circuit to achieve a given clocking scheme can be used to obtain fully polynomial-time 
approximation schemes for several problems related to retiming for minimum clock period. 
Specifically, using binary search in the clock period domain as described in Section 2.9, 
we can solve the retiming and symmetric tuning problem in 0((VE + V 2 lg V) \g(kV/e)) 
time with relative error e. Similarly, we can solve the retiming and fixed-duty-ratio tuning 
problem in 0(kV 3 \g{kV/e)) time. A straightforward generalization of the linear search 
scheme in Section 2.9 yields a fully polynomial-time approximation scheme for the general 
retiming and tuning problem for simple A;-phase circuits. The running time of this algorithm 
contains a factor e~ fc , however, which can be prohibitively large for a small relative error e. 

2.12 Conclusion 

Our algorithms for verifying and optimizing level-clocked circuits have their limitations. 
For example, as is the case with most work on retiming, it is hard to incorporate data- 
dependent propagation delays in the framework without making most of the interesting 
questions NP-hard. Several issues, however, are amenable to efficient algorithmic solutions. 
We address some of these issues in this section. We first discuss how to cope in our ideal 
model with nonideal clocking waveforms. We then show how to adjust our approximation 
algorithms in order to compute exact solutions to the various retiming and tuning problems. 
We move on to describe the incorporation in our model of precharged gates, nonuniform 
propagation delays, and nonzero minimum propagation delays. We conclude by discussing 
generalizations of our algorithms to handle gated clocks and nonsimple multiphase clocking 
disciplines. 

A phenomenon that arises in practice is that clock phases are never truly square waves; 
the waveform rises or falls over an interval of time. To cope with this effect in our ideal 
model, one can ensure that the gaps between clock phases are sufficiently large that consec- 
utive latches clocked on opposite phases are not high simultaneously. Similarly, the effect 
of clock skew can be handled by adjusting the gap widths. 

The fully polynomial-time approximation schemes in Section 2.9 can be adjusted to find 
an exact solution when the propagation delays of the combinational elements are integers. 
Specifically, the retiming and symmetric tuning problem can be solved exactly in 0((VE + 
V 2 \gV)\g{Vd mBX /8)) time, where 

L 7o 7i 

mm • 



.3|V|- 1 2(3 \V\- l) 2 ' 2(3 \V\ - l) 2 
Similarly, the retiming and fixed-duty-ratio problem can be solved exactly in 0(V 3 lg(Vd ma , x /6)) 
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time. 

An important extension of our work is the incorporation of precharged gates, nonuniform 
propagation delays, and nonzero minimum propagation delays to our circuit model. With 
precharged gates, retiming and clock tuning still have efficient algorithmic solutions, but 
there are many subtleties that arise in the formulation of the constraints. The incorpora- 
tion of functional elements where the propagation delay may differ for different input-output 
pairs (the "nonuniform propagation delay" model from [31]) changes the time complexities 
of our algorithms, but the essential algorithms remain unchanged. When minimum prop- 
agation delays (sometimes called "contamination" delays) are incorporated in the model, 
the output of a functional element does not become invalid until some specified minimum 
amount of time after an input changes. A polynomial-time algorithm for the timing veri- 
fication problem when minimum propagation delays are included in the circuit model has 
appeared in [56]. A polynomial-time algorithm for the clock tuning problem with minimum 
propagation delays has appeared in [53]. We believe that many of our optimization algo- 
rithms can be generalized to handle such circuits in polynomial time. This is a topic of 
current research. 

Two generalizations of our work which seem more problematic are the handling of gated 
clocks and nonsimple multiphase clocking disciplines. When logic circuits involve the clock 
signals themselves (so-called gated clocks), it is possible to show, in the general case, that 
many timing verification and optimization problems are NP-hard. Nevertheless, we suspect 
that by making conservative assumptions regarding the behavior of such circuits, many of 
these problems become tractable. Nonsimple multiphase circuits also exhibit many sub- 
tleties. A path in such circuits may pass through latches in an arbitrary order, rather than 
the canonical order assumed in a simple multiphase circuit. Though the proper timing of 
such circuits can be verified using the analysis and algorithms from [20], the timing opti- 
mization of such circuits is possibly more complex. Whether these problems have efficient 
solutions is a topic for further research. 



Chapter 3 

Tim: A Timing Package for 
Level-Clocked Circuitry 



3.1 Introduction 

In this chapter we describe Tim, a new timing package we have developed for circuitry that 
employs a nonoverlapping two-phase clocking scheme. Tim's features include optimizations 
for retiming and sensitivity analysis as well as more conventional operations such as veri- 
fication of proper timing and optimal tuning of clocking schemes. The entire package has 
been written using the C programming language, and it has been integrated into the SIS 
tools from Berkeley. Copies of the software have been available over the Internet since June 
1993 and can be obtained by sending a request to marios@lcs.mit.edu. 

Several tools have been developed for analyzing the timing of circuitry that contains 
level-clocked latches [1, 4, 6, 23, 40, 51, 54]. These tools perform timing verification and 
enable the user to minimize the overall clock period by tuning various parameters of the 
clocking schemes. Our tool provides the designer with two additional features: retiming and 
sensitivity analysis. Moreover, the algorithms in Tim are based on the algorithms described 
in Chapter 2, and thus they are provably correct and run in polynomial time. 

Tim has been applied on a variety of circuits that have been obtained from academic and 
industrial sources. The results from the application of our tool are presented in Chapter 4. 

This chapter is organized as follows. Section 3.2 gives an overview of the system, and 
Section 3.3 presents Tim's circuit model which extends the ideal model that we assumed in 
Chapter 2. Section 3.4 describes the algorithms that we implemented in Tim and the new 
constraints for the extended model. Section 3.5 concludes this chapter by discussing the 
performance of our implementation. 
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3.2 System Overview 

Tim provides the following classes of operations for circuitry that employs nonoverlapping 
two-phase clocking schemes: 

Timing verification: Given a two-phase circuit and a clocking scheme, Tim verifies 
whether the circuit is properly timed. 

Sensitivity analysis: Given a two-phase circuit that is properly timed by a given clocking 
scheme, Tim can identify for each gate the maximum increase in its propagation delay that 
will not affect proper timing of the circuit by the given clocking scheme. Moreover, given a 
feasible clocking scheme ir, Tim can identify for each critical gate in the circuit the minimum 
decrease in its propagation delay that will remove that gate from the critical path. 

Clock tuning: Given a two-phase circuit, Tim computes a nonoverlapping two-phase 
clocking scheme so that the given circuit operates at maximum speed. 

Retiming for speed: Given a two-phase circuit and a clocking scheme, Tim computes a 
retiming of the given circuit that is properly timed by the given clocking scheme. When the 
circuit is not bound to a specific clocking scheme, Tim can perform retiming in conjunction 
with simultaneous clock tuning, so that the resulting circuit operates at maximum speed. 
Tim is able to perform retiming and clock tuning optimally both for clocking schemes with 
fixed duty-ratio and for clocking schemes with unrestricted duty-ratio. 

Retiming for minimum latch count: Given a two-phase circuit and a symmetric clock- 
ing scheme, Tim can compute a retimed circuit with minimum latch count that is properly 
timed by the given clocking scheme. 

3.3 Circuit Model 

Tim manipulates circuits that employ a nonoverlapping two-phase clocking scheme ir = 
((f> , 70 , cf>i , 71 ) with clock period ir = cj) + 7 + (f>i+ 71 . The gaps 7 and -ji between the two 
phases handle engineering considerations such as setup and hold times of the latches, clock 
skew, and nonideal clocking waveforms. 

The circuit model employed by Tim is based on the one described in Chapter 2. 
A two-phase, level-clocked circuit is modeled as a vertex- weighted, edge-weighted graph 
G = (V,E,d,w,x)- The vertices in V represent blocks of combinational logic, and the 
edges in E represent wires. The nonnegative, real vertex-weight d(v) denotes the maximum 
propagation delay of the signals through the block represented by v. The minimum propa- 
gation delay (contamination delay) of every block is assumed to be zero. The nonnegative, 
integer edge-weight w(e) denotes the number of latches on the wire represented by e. Each 
latch is clocked by one of the two phases of 7r. Whenever the clock input of a latch is 
asserted, the latch becomes transparent and data ripple through. Along any path in G, 
latches are clocked on alternate phases, and around any directed cycle in G there are at 
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least two latches. The function \ : V — > {0, 1} denotes for each vertex v the phase that 
clocks all latches that can reach v along a purely combinational path. 

In Tim we have extended our ideal model from Chapter 2 to include non-ideal latches: 
All latches in Tim are assumed to have equal propagation delays, equal setup times and 
equal hold times. (The equality restriction is not necessary for the algorithms that perform 
timing verification, clock tuning and sensitivity analysis. The proofs for the correctness and 
the running time of the retiming algorithms, however, rely on the assumption that all latches 
have identical delay characterisitics.) Each latch has two delay parameters associated with 
it. The first parameter, which we denote by d out , gives the time between the rising edge of 
the phase that clocks the latch and the moment that data appear at the latch output. The 
second parameter, which we denote by d thru , gives the propagation delay of data through 
the latch when the clocking phase is already high at the time that data arrive at the latch 
input. Setup and hold times are embedded in the gaps 7 and "y 1 of the clocking scheme it. 
Given an intended clocking scheme ir' = {</>o,7o,</>i)7i)i our algorithms operate on the ideal 
model assuming a clocking scheme ir = (cf>' — S, 7^ + S, (f>[ — S, 7J + S), where S is the setup 
time of a latch. In order to avoid data contamination due to zero minimum propagation 
delays, the gaps 7^ and 7J are required to exceed the hold time H of a latch. 

Another phenomenon that arises in practice is that clock phases are never truly square 
waves; the waveform rises or falls over an interval of time. To cope with this effect in Tim, 
one can ensure that the gaps between clock phases are sufficiently large that consecutive 
latches clocked on opposite phases are not high simultaneously. Similarly, the effect of clock 
skew can be handled by adjusting the gap widths. 

3.4 System Operation 

In this section we describe the algorithms implemented in Tim. We first describe the 
shortest-paths algorithms that we implemented. We then describe the operations of our 
tool. For each of these operations, the user may use a technology library to determine the 
propagation delays of the gates, or he can specify all gates to have unit propagation delays. 

Shortest-paths computations are repeatedly employed in almost all functions of Tim. 
We have implemented two single-source shortest-paths algorithms. The first is an algorithm 
by Bellman and Ford that runs in 0(VE) time and solves the shortest-paths problem on 
graphs with real edge- weights [5]. We use this algorithm before all-pairs shortest-paths 
computations in order to compute graphs with identical shortest-paths and nonnegative 
edge-weights, because shortest-paths can be computed more efficiently on such graphs. 

The second shortest-paths algorithm that we have implemented is Dijkstra's algorithm 
for graphs with nonnegative edge- weights [5]. The running time of this algorithm depends 
on the implementation of the priority queue it employs. We tried three priority queues. The 
first one was a simple array, in which case the theoretical running time of the algorithm is 
0(V 2 ). We used this implementation in the beginning until we debugged our programs. The 
array was soon a limiting factor in the performance of our tool, and so we implemented two 
additional data structures: a Fibonacci heap [11] that yields an 0(E + V\g V) running time, 
and a binary heap that yields an 0(E lg V) running time. Our binary heap implementation 
was faster than the Fibonacci heap implementation, most likely because circuit graphs are 
sparse and the overhead of a Fibonacci heap is too high. 
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3.4.1 Verification of Proper Timing 

The timing verification operation in Tim is a direct implementation of Algorithm TV that is 
described in Section 2.3 and runs in 0(VE) time. In our implementation, we have adapted 
Inequalities (2.11), (2.12), and (2.8) to take into account the propagation delays of the 
latches. Specifically, for v G V and i = 0, 1, . . . , 3 \V\ — 3, we check the constraints 

d ou t + * • d thru + D(v, i) < -k ( — — ) + (f>i- x ( v ) 



2 



if i is odd, and 



d ou t + * • d thru + D(v, i) < -k { —— j - 7i_ x („) 
if i is even; and for every simple cycle c 

(w(c) 

d(c) + 10(C) • d t/lr „ < 7T I — — 



3.4.2 Sensitivity Analysis 

The current version of Tim performs both noncritical and critical sensitivity analysis. 

The noncritical sensitivity analysis is performed for all gates in the circuit, and it is 
a direct implementation of Algorithm AllNCSA. The asymptotic running time of our 
implementation, however, is 0(VE lg V) time, because we employ a binary heap as a priority 
queue. 

The critical sensitivity analysis in Tim is not a direct implementation of Algorithm 
CSA. Given a circuit G and a feasible clocking scheme it, Tim runs Algorithm AllNCSA 
to identify the critical gates in the circuit, that is, the gates with zero slack. Then, for each 
critical gate u, Tim sets d(u) = and runs Algorithm NCSA(G, tt',u), where the clocking 
scheme it' is computed by tuning the clock of G when d(u) = 0. The user can restrict the 
clocking scheme ir' to have the same duty-ratio as the clocking scheme it. Alternatively, he 
may choose to use a ir' that is computed by clock tuning over all possible duty-ratios. Our 
implementation of Algorithm NCSA is based on the 0(VE)-time shortest-paths algorithm 
by Bellman and Ford. Thus, the critical sensitivity analysis operation in Tim runs in a total 
of 0(VElg V + cVE) steps, where c denotes the number of critical gates for the clocking 
scheme it. 



3.4.3 Clock Tuning 

Tim can perform clock tuning with either fixed or unconstrained duty-ratio. Our imple- 
mentation is based directly on the algorithm described in Section 2.5 and terminates in 
0(VE) time. The clock tuning algorithm in Tim computes the 0(V 2 ) values D(v,i) and 
then solves a linear program in two dimensions in order to determine the optimal point of 
the ensuing constraints. 



3.4. SYSTEM OPERATION 



97 



3.4.4 Retiming for Speed 

Included among Tim's features are algorithms for the retiming problem and fully-polynomial 
approximation algorithms for the various retiming and tuning problems. The basic subrou- 
tine of all these algorithms is Algorithm Retime which operates on the constraints (2.26), 
(2.27) and (2.28). In our implementation, we have adapted these constraints to account for 
the propagation delays of the latches. Specifically, Tim computes a retiming such that for 
every edge u — > v £ E, we have 



r(u) — r(v) < w(e) 



and for every path « ~» v, we have 



dout + (w(p) +r(v) -r(u)) -d thru + d(p) < n ( j + <f> x(u) 
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— 7T 
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+ (r(v) mod 2) (j x(u) + 4>i- x ( u) ) 
- (r(u) mod 2) (4> x(u) + j x(u) ) , 



if x(«) t^x(v), and 



dout + (w(p) +r(v) -r(u)) -d thru + d(p) < ir ( j - 7i-x(«) 
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+ (r(v) mod 2) (71-xH + 4> x (u)) 
- (r(u) mod 2) (4> x(u) + j x(u) ) . 



if x( u ) = x( v )- It is straightforward to verify that these constraints can be brought into 
the form f(r(v)) > g(r(u)), where / and g are monotonic functions. Using a binary heap 
as a priority queue, Tim generates the retiming constraints in 0(VElgV) time by an all- 
pairs shortest-paths computation. Therefore, our implementation of Algorithm Retime 
terminates in 0(V 3 + VElgV) steps. 

Retiming in Tim does not relocate I/O latches. This constraint can be easily enforced 
by introducing the pair of inequalities 

r(u) — r(v) > 
r(u) — r(v) < 

for every pair of vertices u,v G V such that u —* v and there exists an I/O latch on the 
wire e. 



3.4.5 Retiming for Minimum Latch Count 

Tim's retiming algorithm for minimum latch count is an implementation of Orlin's capacity 
scaling algorithm for computing minimum-cost flows [38], and it terminates in 0(V 3 ) steps. 
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Our implementation works on a modified circuit graph that takes into account latch sharing. 
The construction of this graph is identical to the construction presented in [31] for retiming 
of edge-triggered circuitry. Tim takes into account the propagation delays of the latches by 
finding a solution to the constraint 

r(u) — r(v) < w(e) 

for every edge u — > v in G, and 

dout + (w(p) + r(v) - r(u)) ■ d thru + d{p) < ir(w(p) + r(v) - r(u))/2 + 7r - 7 

for every path u ~> v in G. It is straightforward to verify that these constraints can be 
rewritten in a form that is the dual of an uncapacitated minimum-cost flow problem. As 
with retiming for speed, retiming for minimum latch-count does not relocate I/O latches. 

3.5 System Performance 

Our primary concern during Tim's implementation was to achieve correct functionality. 
Speed, although important, was a secondary concern. Nevertheless, Tim operates reason- 
ably fast on reasonably large inputs. We have used extensively the current version of Tim 
on a Sun SPARCstation 2 with 64MB of main memory. For circuits with 1,500 gates after 
technology mapping, the timing analysis and clock tuning operations typically require a 
couple of minutes. The retiming operations, however, require approximately 35 minutes. 
For large inputs, our tool is almost always slowed down because of paging. The adverse 
effect of paging on the running time of our algorithms is particularly evident when retiming, 
due to the 0(V 2 ) space requirements of the retiming algorithms. 

There are several straightforward ways to speed up Tim's retiming operations. The 
practical efficiency of the latch count minimization algorithm will possibly improve by using 
a simplex-based algorithm to solve the minimum-cost flow problem. Even though simplex- 
based algorithms for minimum-cost flow are not guaranteed to run in polynomial time, they 
have been shown to perform particularly well in practice [15]. 

Another operation in Tim with potential for easy improvement is retiming with sym- 
metric clocking schemes. The current implementation of Tim employs the general retiming 
algorithm Retime even when the clocking schemes are symmetric. We believe that the 
practical efficiency of this operation can be substantially improved by implementing Algo- 
rithm RwSCS that runs on a sparse graph representation of the problem. 



Chapter 4 

Edge- Triggering vs. Level-Clocking 



4.1 Introduction 

Level-clocking is becoming an increasingly popular alternative to edge-triggering as a clock- 
ing methodology for high-performance designs. Proponents of level-clocking argue that 
level-clocked circuitry can provide more flexibility in meeting a specific clock period and 
that it has the theoretical potential to operate faster than the more conventional edge- 
triggered circuitry. These arguments are based on the fact that in level-clocked circuitry 
computations are allowed to extend beyond a single clock period during the transparent 
phase of the level-clocked latches, in constrast to edge-triggered circuitry, in which compu- 
tations along every path must complete within a clock period. Advocates of edge-triggering, 
on the other hand, present simplicity and implementation ease as major advantages of edge- 
triggering, since edge-triggered latches directly support the abstraction of a storage element 
that is synchronized by the ticking of a clock. They also refer to the existence of powerful 
design tools as another major incentive for designing circuitry with edge-triggered latches. 

These arguments in support of edge-triggering and level-clocking are either theoretical or 
nonquantifiable. We wanted to make an empirical study of the two clocking methodologies. 
For this purpose, we have used Tim to run experiments that compare edge-triggered imple- 
mentations of synchronous circuitry and corresponding level-clocked implementations that 
employ a two-phase, nonoverlapping clocking scheme. Our empirical comparison focused 
on two specific quantitative measures: speed and number of storage elements. We urge the 
reader not to interpret our results in a narrow quantitative manner, however, since our tool 
may have introduced round-off errors. The reader should rather focus on the qualitative 
conclusions that can be drawn from our comparison. 

Our speedup experiments show that edge-triggered circuitry often operates just as fast 
as two-phase circuitry, despite the theoretical advantage of two-phase clocking, and that the 
speed potential of two-phase clocking is generally obtained only when the combinational 
delay between any two consecutive latches is roughly uniform and close to the maximum 
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gate delay. Our experiments also show that two-phase clocking leads to greater speedups 
when all gate delays in the circuit are roughly equal. Our experimentation with clock 
tuning suggests that asymmetric clocking schemes provide little or no speedup over optimal 
clocking schemes. 

With respect to the number of storage elements, however, our experiments demonstrate 
that two-phase clocking can lead to substantial reductions in aggressive edge-triggered de- 
signs that operate at maximum speed. For two of our test circuits, we obtained reductions 
of 38%; for more than one third of our circuits, we obtained reductions of 25%; and for 
more than half of our circuits, reductions exceeded 18%. For low-performance designs that 
operate below their speed potential, however, our experiments show that two-phase clocking 
does not reduce the number of storage elements. 

How can edge-triggering and two-phase clocking be fairly compared on an empirical 
basis? First, a fair experiment should compare competing circuit implementations that 
have the same functionality. It would be meaningless to compare two circuits that compute 
different functions. Second, a fair experiment should compare competing circuit implemen- 
tations based solely on differences due to their storage elements. It would be unfair to 
compare two circuits which differ in their combinational elements, for example, because 
such a comparison would not depend only on the clocking methodology employed. 

We did not have the resources to embark on designing pairs of competing circuits for 
various applications. We settled, therefore, on the strategy of taking edge-triggered circuits 
and using our timing tool Tim to produce equivalent two-phase circuits. We could have done 
the reverse, converting two-phase circuits to edge-triggered ones, but since edge-triggered 
designs are more popular, we were able to obtain several interesting edge-triggered circuits. 
(We hope to also obtain interesting level-clocked designs and remedy this situation before 
the full paper is published.) 

We produced a two-phase circuit from an edge-triggered one by following a two-step 
procedure. The first step of this procedure was to replace each edge-triggered latch by a 
pair of back-to-back level-clocked latches that are clocked by a two-phase, nonoverlapping 
clocking scheme, as shown in Figure 4-1. (In fact, it is common in VLSI to implement 
edge-triggered latches by a pair of back-to-back level-clocked latches [14, 58].) The two- 
phase circuit that results after this conversion has the same clock period and the same 
number of storage elements as the original edge-triggered circuit, assuming that each edge- 
triggered latch counts as two level-clocked latches. Moreover, the placement of its latches 
is dictated by the original edge-triggered design, and the potential of two-phase clocking 
due to alternate placements of the latches in the circuit is not revealed. Thus, we needed 



□ 



Figure 4-1: Replacement of an edge-triggered latch by a pair of level-clocked latches. The edge- 
triggered latch is clocked by a single clock </>, and the two level-clocked latches are clocked on the 
phases <j>q, <p\ of a two-phase, nonoverlapping clocking scheme. 
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Figure 4-2: This figure illustrates retiming of a gate with lag 1. In this case, one storage element 
is removed from each output wire of the gate, and one storage element is inserted at each input wire 
of the gate. The total number of storage elements is reduced by 1. The critical paths in the circuit 
may also change as a result of the relocation of the storage elements. 

a method to relocate storage elements and explore the space of possible placements in the 
circuit without changing its functionality or its I/O specification. 

The second step of the procedure was to use the "retiming" transformation to relocate 
the storage elements of the two-phase circuit that resulted from the first step. Retiming 
relocates storage elements in both edge-triggered and level-clocked circuitry without chang- 
ing its functionality [28, 29, 31]. In addition, retiming is a "universal" transformation for 
speeding up circuits, in the sense that any other functionality-preserving transformation 
that did better than retiming would depend on the functionality of the gates in the circuit 
[29]. Figure 4-2 illustrates the retiming operation for a gate in a circuit. Observe that 
retiming can change the clock period as well as the number of storage elements in a circuit. 

Our experimental procedure compared an optimal edge-triggered implementation that 
we obtained from an original edge-triggered circuit with an optimal two-phase implementa- 
tion of the corresponding two-phase circuit. The use of an optimal edge-triggered implemen- 
tation as a reference point was essential to ensure that we did not penalize edge-triggering 
due to suboptimalities in the original edge-triggered circuit that depended on the placement 
of the storage elements by the circuit designer and were not intrinsic to edge-triggering. 

We performed two kinds of experiments. The first kind of experiments compared edge- 
triggered and two-phase circuits with respect to speed. Our basic experimental approach was 
the following. First, we retimed a given edge-triggered implementation for maximum speed. 
Then, by using retiming in conjunction with tuning of the clocking schemes, we obtained the 
fastest possible implementation of the corresponding two-phase circuit, and we compared 
the speed of the two optimal implementations. The second kind of experiments compared 
edge-triggered and two-phase circuits with respect to their number of storage elements, 
when operating at some specified clock period. We first retimed a given edge-triggered 
implementation without changing its I/O specification, in order to achieve the specified 
clock period with the minimum number of edge-triggered latches. We then retimed the 
corresponding two-phase circuit without changing its I/O specification, in order to achieve 
the same clock period with the minimum number of latches. We compared the number 
of storage elements in the two optimal implementations, under the reasonable assumption 
that each edge-triggered latch counts as two level-clocked latches. 

Timing verification and optimization of synchronous circuitry has been the subject of 
extensive study [3, 6, 17, 20, 21, 23, 28, 31, 32, 34, 51, 52, 54, 56, 57]. The concept of 
replacing each edge-triggered latch by a pair of back-to-back level-clocked latches, and then 
using retiming for speed optimization has been mentioned in [3, 21, 32]. The potential of 
level-clocking for reducing the number of storage elements has been mentioned in [32]. The 
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idea of using latches instead of edge-triggered latches has been also used in [55]. Retiming for 
speed has been studied in the context of single-phase level-clocked circuits in [52]. Despite 
the large amount of work in this area, our contribution is (we believe) the first attempt to 
quantify empirically the performance differences of edge-triggering and two-phase clocking. 
The remainder of this chapter has three sections. Section 4.2 describes our experimen- 
tal methodology and reports our results on the relative speed of the two implementation 
approaches. In Section 4.3 we present our experimental results on latch-minimization. 
Section 4.4 concludes with a discussion of our results and directions for further research. 
Appendix A. 2 presents an upper bound on the relative speedup that can be achieved by 
two-phase clocking over edge-triggering and explains intuitively how two circuit character- 
istics, the maximum gate delay c? max and the maximum delay-to-register ratio R of the 
circuit, determine the speedups that can be achieved by two-phase clocking. 

4.2 Speedup Experiments 

In this section, we present our investigation of edge-triggering and two-phase clocking with 
respect to speed. First, we briefly refer to our tools and test circuits. We move on to 
describe and motivate our experimental methodology, and then we discuss our results. 

Our experiments were performed using Tim. Our test circuits were MCNC benchmark 
circuits and AT&T communication circuits, all of which were originally designed with edge- 
triggered latches. The largest among these circuits had 290 gates. 

4.2.1 Experimental Methodology 

In our speedup experiments we employed the following three optimizations: 

OP1 Retiming of edge-triggered circuitry for maximum speed of operation (minimum clock 
period). 

OP2 Retiming of two-phase circuitry for maximum speed of operation with a symmetric 
clocking scheme. 

OP3 Retiming and simultaneous clock tuning of two-phase circuitry for maximum speed of 
operation. 

Using these three optimizations, we initially performed experiments SP1 and SP2. 

SP1 We compared the speed of each original edge-triggered circuit that was optimized 
using OP1 with the speed of the corresponding two-phase circuit that was optimized 
using OP2. 

SP2 We compared the speed of each original edge-triggered circuit that was optimized 
using OP1 with the speed of the corresponding two-phase circuit that was optimized 
using OP3. 

The goal of our experimentation was not only to investigate whether two-phase clocking 
could speed-up the particular edge-triggered circuits in our test suite. We also wanted to 
determine specific design characteristics that may lead to faster two-phase circuits. To 
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that effect, we performed experiments SP3 and SP4 on altered versions of our original test 
circuits that were obtained by modifying them in two ways. 

The first modification changed the number of storage elements in the original circuits by 
pipelining, a transformation that increases the latency of a computation without decreasing 
its throughput. A degree-P pipelining of each circuit was obtained by multiplying the 
original number of latches on each wire of the circuit by the integer P. The purpose of this 
transformation was to investigate which of the two implementation approaches is favored 
more when the number of storage elements is increased. 

The second modification changed the gate delays of the original circuits. For each circuit 
G, we created four additional circuits G { for i = 0.8, 0.6, 0.4, 0.2. Each G { was topologically 
identical to G, but its gate delays d { were modified. For each circuit G i: each gate delay 
di(v) was set equal to d l (v), where d(v) was the original delay assigned to v. For each gate 
in a circuit G, this original delay was the worst-case propagation delay between an input 
pin and an output pin of the gate, based on the technology library that was used with the 
circuit. Thus, for smaller values of the exponent i, the gate delays in the circuits G { became 
increasingly uniform. The objective of this modification was to see how uniformity of gate 
delays affects the speed of the two implementations. 

Using the three optimizations on the modified circuits we performed experiments SP3 
and SP4. 

SP3 On each circuit G { for i = 1.0, 0.8, 0.6, 0.4, 0.2, we applied the following procedure for 
P = 1, 2, . . . , 6. We optimized the edge-triggered circuit using OP1, and we compared 
its speed with its corresponding two-phase circuit that was optimized using OP2. 

SP4 On each circuit G { for i = 1.0, 0.8, 0.6, 0.4, 0.2, we applied the following procedure for 
P = 1, 2, . . . , 6. We optimized the edge-triggered circuit using OP1, and we compared 
its speed with its corresponding two-phase circuit that was optimized using OP3. 

Note that for i = 1.0 and P = 1, experiments SP3 and SP4 were identical to experiments 
SP1 and SP2, respectively. 

4.2.2 Experimental Results 

Remarkably, our initial experiments SP1 and SP2 indicated that two-phase clocking was 
no better than edge-triggering for any of our test circuits. The application of the three 
optimizations OP1, OP2, and OP3 on the original circuits, with gate delays assigned by their 
corresponding libraries, showed no speedup by switching to two-phase clocking. Although 
this result was surprising and unexpected, it could not have been a mere coincidence. Our 
subsequent empirical investigation with experiments SP3 and SP4 led us to the conclusion 
that there are two important circuit characteristics that determine the relative speed of the 
two implementation approaches: the maximum gate delay c? max and the maximum delay- 
to-register ratio R, which is defined as the maximum ratio of total delay over total latch 
count around the cycles in the edge-triggered circuit. 

Our experimental results for SP3 are illustrated in the plots of Figure 4-3. Each plot 
gives data for an original test circuit G 10 and its four delay-modified versions G { for i = 
0.8, 0.6, 0.4, 0.2. For each of the five delay configurations of a test circuit and for degrees of 
pipelining up to 6, each plot gives two numbers: the speedup fid jet achieved by two-phase 
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clocking over edge-triggering, and the ratio d max /R, where c? max is the maximum gate delay 
and R is the maximum delay-to-register ratio of the circuit. For each circuit Gj, the value 
of the ratio d m&x /R closest to 1 is boldfaced. 

As is apparent from the graphs, for almost every delay configuration, the maximum 
speedup is achieved when the ratio d m&x /R is closest to 1. In the five configurations of 
multl6a, for example, as the ratio d max /R increases from small values and approaches 1 
from below, the speedup constantly increases. When the ratio exceeds 1, the speedup 
soon drops down to 1. A similar pattern is revealed for almost all of our test circuits. 
This phenomenon can be justified as follows. The maximum delay-to-register ratio R is 
a lower bound on the clock period of both the edge-triggered and the level-clocked circuit 
[21, 43]. Consequently, the longest combinational delay in the circuits is at least R under any 
transformation that does not change the number of latches around the cycles in the circuit. 
Retiming distributes the latches, however, so that combinational path delays are roughly 
equal across the circuit and close to the critical ratio R. When R becomes comparable to 
the maximum gate delay d max , then the longest combinational delay also tends to approach 
d max , and then the potential of two-phase clocking becomes apparent. Intuitively, when 
R approaches d max , level-clocking evens out differences among path delays more effectively 
than edge-triggering by letting the computations ripple through the transparent latches. 

Let us examine more closely some characteristic graphs in Figure 4-3. Our initially 
surprising results from experiments SP1 and SP2 can be explained by looking at the ratios 
dm&x/R of the original circuits, which correspond to P = 1 and i = 1.0. For every such 
circuit, the ratio d max /R is smaller than 0.67. The only exceptions are multl6b, ampseq2, 
and ampseql. multl6b has a ratio greater than 1, and consequently, it is already heavily 
pipelined. For higher degrees of pipelining, multl6b does not become any faster, which 
leads us to the conclusion that the original design of mult 16b takes full advantage of any 
existing speed potential. The situation with ampseq2 is similar. The original design has 
already no margin for improvement, and for higher degrees of pipelining there are sufficiently 
many storage elements for edge-triggering to be as fast as two-phase clocking. The situation 
with ampseql is somewhat different. The ratio d m&x /R is close to 1, but there is still room 
for improvement, since without any pipelining, that is, for P = 1, all versions of ampseql 
become faster by level-clocking. 

Another conclusion that we can draw from the plots in Figure 4-3 is that two-phase 
clocking leads to greater speedups when the gate delays are more uniform. For every test 
circuit, peak speedups increase as the exponent i decreases, that is, as the gate delays 
become more uniform. This observation suggests that standard-cell designs in which gate 
delays are roughly equal are likely to benefit from two-phase clocking. 

The data shown in Figure 4-3 are the results of experiment SP3, in which the two-phase 
circuits were clocked by symmetric clocking schemes. We also performed experiment SP4 
that combines retiming with tuning of the clocking schemes. In all cases, however, OP3 
did not provide any speedup greater than 2% over OP2. Thus, our experiments suggest 
that clocking with asymmetrical schemes often does not provide any speed advantage over 
symmetric schemes. 
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Figure 4-3: Results of experiment SP3 on the MCNC benchmark and the AT&T circuits. Each 
plot corresponds to a test circuit. The first row of the horizontal axis gives the pipelining degree P. 
Each of the next five rows corresponds to a circuit G, for i = 0.2,0.4,0.6,0.8,1.0, and it gives the 
ratio d max /i? for each pipelining degree. In each row, the ratio d m&x /R closest to 1 is boldfaced. The 
vertical axis gives the speedup fi c /f e t obtained for a specific i and a specific P. The clock frequency 
f et was obtained by applying OP1. The clock frequency fi c was obtained by applying OP2. For 
almost every test circuit, maximum speedups are achieved when the ratio d max /i? is closest to 1, or 
equivalently, when R is closest to d max . Greater peak speedups are achieved as we move from Gi.o 
to Go. 2, that is, as the gate delays become roughly equal across the entire circuit. The results of 
experiment SP4 have no significant differences from the results in this figure. 
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4.3 Latch Count Minimization Experiments 

In this section we present our experimental comparison of edge-triggering and two-phase 
clocking in terms of the number of storage elements required by each implementation ap- 
proach. We first describe our methodology, and then we present and discuss our experi- 
mental results. 



4.3.1 Experimental Methodology 

In our experiments, we employed retiming in order to minimize the number of storage 
elements in the circuits. We retimed both the original edge-triggered circuits and their 
corresponding two-phase circuits in order to achieve a given clock period with the minimum 
number of storage elements. In both cases, the retiming transformation was applied without 
relocating the I/O storage elements of the circuits, and thus the I/O specification remained 
unchanged. 

We compared the two implementations of each circuit by performing experiments LM1 
and LM2. 

LM1 We retimed the original edge-triggered circuit, in order to achieve the minimum period 
possible with the minimum number of edge-triggered latches. Then, we retimed the 
corresponding two-phase circuit in order to achieve the same period with the minimum 
number of level-clocked latches. We compared the number of level-clocked latches in 
the two optimal circuits, where each edge-triggered latch counted as two level-clocked 
latches. 

LM2 We retimed the original edge-triggered circuit, in order to achieve its original clock 
period specification with the minimum number of edge-triggered latches. Then, we 
retimed the corresponding two-phase circuit, in order to achieve the same period with 
the minimum number of level-clocked latches. We compared the number of level- 
clocked latches in the two optimal circuits, where each edge-triggered latch counted 
as two level-clocked latches. 

The motivation behind these two experiments was to investigate the impact of two-phase 
clocking on the number of storage elements under different conditions of operation. Exper- 
iment LM1 was aimed at the typical situation, where speed is the primary concern, and 
edge-triggered circuits are configured to operate at the maximum of their potential. It is 
often the case, however, that the clock period is dictated by external system considerations 
and cannot be changed easily. To that effect, we also performed experiment LM2, which 
compares the number of storage elements in the two implementations when the clock period 
equals that of the original edge-triggered circuit. 

We performed our experiments using Tim. The latch count minimization algorithms in 
Tim run in polynomial time and take into account maximal sharing of storage elements. 
We ran our tests on MCNC benchmark circuits, AT&T communication circuits and custom 
circuitry designed for MIT's Alewife machine. The largest among these circuits had 340 
gates. 
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Figure 4-4: Results of experiment LM1. In this experiment, each edge-triggered circuit and its 
corresponding two-phase circuit operate at the minimum clock period that can be achieved by 
retiming the original edge-triggered circuit. For each circuit, the table gives its operating period, the 
minimum number of level-clocked latches in the edge-triggered implementation after retiming (this 
number equals twice the number of edge-triggered latches in the circuit), the minimum number of 
level-clocked latches in the two-phase, level-clocked implementation after retiming, and the reduction 
in the number of storage elements with respect to edge-triggering. 
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Figure 4-5: Results of experiment LM2. In this experiment, each edge-triggered circuit and its 
corresponding two-phase circuit are clocked at the original clock period specification of the edge- 
triggered circuit. For each circuit, the table gives its operating period, the minimum number of 
level-clocked latches in the edge-triggered implementation after retiming (this number equals twice 
the number of edge-triggered latches in the circuit), and the minimum number of level-clocked latches 
in the two-phase implementation after retiming. Note that the level-clocked latch count decreases 
only for ampseq3. 
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4.3.2 Experimental Results 

Our experimental results for the two sets of experiments are shown in Figures 4-4 and 4-5. 
There is a striking difference betwen these two sets of results. When the operating period 
is the minimum clock period that can be achieved by retiming the original edge-triggered 
circuit, then two-phase clocking leads to substantial reductions in the number of storage 
elements. When the operating period is that specified for the original circuit, however, there 
are almost no gains in the number of storage elements when we switch from edge-triggering 
to two-phase clocking. 

In the experimental results of Figure 4-4 the greatest reductions were achieved for two 
controller circuits. The number of level-clocked latches in both s382 and the DRAM con- 
troller DRAM-ctl of the Alewife machine was reduced by 38%. Substantial reductions were 
also achieved for the multiplier circuits multl6a, s344, s349, for the controller circuit s400, 
as well as for the communication circuits ampseql, ampseq3, and ampseq4. The two circuits 
s641 and s820 for which the number of storage elements did not decrease by two-phase 
clocking were PLD's. 

Figure 4-5 shows that for all circuits except ampseq3, there was no reduction in the 
number of storage elements when the circuits were operating at the clock period specifica- 
tion of the original circuit. This seemingly negative result can be explained by comparing 
the clock periods of the original and the optimally retimed designs. In most cases, the 
original circuits operate substantially slower than the optimally retimed circuits. Most no- 
tably, the original multl6a is almost four times slower than its minimum-period retimed 
version. When the original clock period specification is so far from the minimum achievable, 
the placement of the storage elements in the edge-triggered circuit is as flexible as in the 
two-phase implementation, and thus no additional reductions are achieved by two-phase 
clocking. In the optimally retimed edge-triggered circuits, however, the minimum number 
of storage elements increases substantially, as it can be verified by comparing the columns 
in Figures 4-4 and 4-5 that give the latch counts in the edge-triggered designs. Two-phase 
clocking can decrease this number without degrading circuit performance. In fact, as it is 
evident from the columns in Figures 4-4 and 4-5 that give the latch counts in the level- 
clocked designs, the number of level-clocked latches in more than half of the aggressive 
two-phase implementations is not more than 15% higher than the number of level-clocked 
latches in the low-performance implementations. 



4.4 Conclusion 

In this chapter we presented an empirical comparison of edge-triggering and two-phase 
level-clocking in terms of speed and number of storage elements. Our methodology was 
independent of the functionality of the circuit and compared the two design approaches 
based solely on the effects of the storage elements in each one of them. 

In our speedup experiments, edge-triggering was often as fast as two-phase level-clocking, 
except when the average propagation delay between any two consecutive latches was roughly 
uniform over the entire circuit and equal to the maximum gate delay, in which case the 
potential of two-phase clocking was generally obtained. Our experimental results suggest 
that circuits designed with standard cells of uniform delay benefit more from two-phase 
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clocking. Moreover, our experiments indicate that symmetric clocking schemes seem to 
perform as well as tuned clocking schemes. 

In terms of number of storage elements, two-phase level-clocking led to substantial 
reductions when the target clock period was set aggressively to the minimum that could be 
achieved by retiming the original edge-triggered circuit. 

We urge the reader not to interpret our results in a narrow quantitative way, since 
our tool may have introduced round-off errors. There are two qualitative conclusions that 
should be drawn from our timing experiments, though. First, under our assumed timing 
model, one should not expect to achieve automatic speedups simply by switching from 
edge-triggering to level-clocking and then by retiming for maximum speed. Second, when 
the clock period of the designs is determined by the constraints around cycles rather than 
the constraints along paths, level-clocking can automatically achieve greater speedups with 
respect to edge-triggering, because it can accommodate more effectively whatever little slack 
there is for the computations along paths. Edge-triggering doesn't have this flexibility. 

We believe that a more extensive experimentation is essential to obtain more conclusive 
results regarding the relative merits of the two clocking methodologies. First, it is necessary 
to experiment with a wider variety of circuits. Some recent results that we obtained with 
proprietary circuits from Digital Equipment Corporation are in accord with the results we 
presented in this chapter. Those circuits, however, were no bigger than our benchmark cir- 
cuits. We believe it is important to experiment with larger circuits. It is also importnat to 
experiment with circuits that were originally designed as two-phase circuits. Another inter- 
esting question that should be further investigated involves asymmetric clocking schemes. 
What design methodology would really favor the use of asymmetric clocking schemes? Can 
asymmetric clocking schemes decrease the number of storage elements even further than 
symmetric ones? We believe that the best way to answer these questions and address many 
other practical concerns regarding level-clocking is to implement actual circuits using our 
tool and explore the existing possibilities in practice. 



Directions for Further Research 



There are many interesting, important and fruitful research directions in the area of timing 
analysis and optimization of synchronous systems. Some of these directions lead to problems 
that are direct extensions of the work presented in this thesis. Others lead to new and 
unexplored terrain. In this chapter we will discuss some of these more challenging directions. 

A rich and largely unexplored field is the area of algorithms for interactive analysis 
of circuit timing. Tim incorporates some elements of interactive analysis by means of its 
sensitivity analysis functions. Can we offer similar functions for retiming? Given that it 
may not always be possible to retime every part of a circuit, are there efficient algorithms 
that would allow us to identify the parts of a circuit that are most promising to retime? Are 
there efficient schemes that would allow us to break a big design into smaller parts, retime 
each of these parts separately, and then combine them again in a single faster design? 
Algorithms that perform these tasks will have enormous impact on the development of 
high-level interactive tools for the design of large circuits. 

Several important issues need to be resolved before retiming becomes a widely used 
timing optimization technique. We believe that verification and modeling are the two most 
crucial among these issues. Verification is used extensively at every stage of circuit design. 
Retiming changes the circuit architecture, and the retimed circuits must be verified once 
again. The challenging task in this case is to compute for a specific set of circuit inputs, the 
new values that must be stored in the latches at any time during the circuit's operation. It 
is not clear how to perform this computation, when retiming moves latches across logic that 
can generate the same output with different input vectors. Modeling is another important 
issue with retiming. The delay model that is used in the retiming literature assigns a fixed 
worst-case propagation delay to each logic gate. As latches are relocated, however, the loads 
of the logic gates change, and consequently their propagation delays change. These changes 
are not accounted for in the model. Proponents of retiming argue that one could always 
size the latch transistors to maintain the same delay characteristics in the circuit. But 
this solution may not be applicable in designs with standard cells that can only use latches 
from a fixed pool. It is essential, therefore, to investigate retiming using more realistic delay 
models and to identify properties in these models that can lead to efficient, polynomial-time 
algorithms. 

Two topics that have not been explored yet in the context of multiphase circuits are 
power dissipation and clock skew optimization. It is possible to retime edge-triggered cir- 
cuitry in a way that reduces its power dissipation. Since there are more latches available 
in corresponding multiphase circuitry, retiming may be more effective for such circuitry. 
Note that for multiphase circuitry, the argument about increased power dissipation due to 
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multiple clock lines is not always an issue, because we can generate a multiphase clock- 
ing scheme by distributing a single clock and then inverting the clocking waveform locally 
to generate the desired waveforms. Clock skew optimization remains a difficult problem 
even for conventional edge-triggered circuits with a single clock. Naturally, the problem is 
more pronounced in multiphase circuits. At the same time, however, multiphase circuits 
offer more ways to tolerate clock skew than conventional edge-triggered circuits. Since the 
successful implementation of high-performance systems depends heavily on the accurate 
timing, we believe it is important to investigate the potential of multiphase clocking in this 
direction. 

Another important issue that remains unresolved in timing is the smooth and efficient 
transition between different levels of abstraction. As circuits increase in size and density, 
it is essential to develop high-level tools that will harness the ensuing complexity. At the 
same time, however, it is essential to have efficient and reliable low-level tools, such as a 
transistor-level timing analyzer and optimizer for level-clocked circuits. When both high- 
level and low-level tools are combined, the designer must be able to move freely across 
the different levels of abstraction. What are good models that simplify problems without 
sacrificing accuracy? How could a designer zoom in and out between the different levels of 
abstraction in order to examine design choices more closely? Where should one draw the 
line between the different abstractions? How could one provide effectively the additional 
computational power required for supporting solutions to these problems? The signifi- 
cance of these questions is not restricted to timing. These questions are fundamental in 
computer-aided design, and their answer requires a concerted effort from researchers who 
are knowledgeable both in computational issues and in circuit design issues. 
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A.l Constraints for Proper Timing 

In this appendix we prove Proposition 13, the fundamental premise of the algorithms in this 
thesis. The proof relies on [20], and a complete understanding of the proof requires famil- 
iarity with the material in that paper. For convenience, however, we give a brief description 
of the notions of "computational expansions," "proper timing," and "A-constraints" that 
are the basis of the proof. 

In this appendix, we adopt the circuit representation of [20]. In that representation, a 
circuit is a graph G = (V, E), where each vertex v G V represents either a functional element 
or a level-clocked latch, and each functional element and level-clocked latch is represented 
by a distinct element of V. Edges in E represent only direct component-to-component 
interconnections and have no weights associated with them. Though each element of V 
continues to have associated with it a propagation delay (equal to zero for latches) and a 
phase (the controlling clock for latches), the functions d and \ are not explicitly included 
as part of the graph G. 

The computational expansion Gqx of a circuit G = (V, E) is a circuit that performs the 
same computation as G, but in a "combinational" fashion. Construction of Gqx essentially 
requires making multiple copies of the components in G and connecting them together in 
such a way that for every change in the output of some component in G, there exists a 
distinct copy of the component, in the computational expansion, which computes the new 
value of the output. Those familiar with optimizing compilers may find it helpful to think 
of the computational expansion as an "unrolling" of the cycles, or "loops," in a level-clocked 
circuit. We generally denote by v t a copy of v G V that exists due to a change in the output 
of v, that is caused by a clock transition that occurs at time t. The results of [20] are based 
on the observation that there exists a strong correlation between the operation of G and 
the operation of the corresponding computational expansion Gqx- 

The computational expansion of a two-phase, level-clocked circuit is defined as the circuit 
Gcx = {V C x, E C x), where 

Vex = { v t '■ v £ V and either 4> x ( v ) rises at time t, or t = — oo} 

Ecx = {ut — ► v t i : u — > v G E;u t ,v t i G Vcx',t < t', and no clock rises during (t,t')} , 

and the delays of the components are defined in the natural fashion. The definition presumes 
that there exists a "start-time" t before which all clock waveforms maintain a constant 
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value, and thus, the circuit is essentially "idle" over the interval (—00, t ). The existence 
of a copy V-oa for each v G V reflects the assumption that the circuit is initialized at time 
—00 into a well-defined "ground" state before it begins to compute. Furthermore, each 
v t € Vex has associated with it an up-time oft and a down-time oft + cf> x ( v y This definition 
of Gcx differs from the definition that appears in [20]. The two definitions are equivalent 
for two-phase circuits, however, as we show later in Lemma 56. 

Intuitively, a level-clocked circuit is properly timed if whenever a latch holds a value (i.e., 
whenever its clock input is Low), it holds the same value it would in an identical circuit 
in which all functional elements have zero propagation delay. Ishii and Leiserson [20] show 
that a level-clocked circuit is properly timed if and only if its computational expansion is 
properly timed. Moreover, they show that if, for any latch-to-latch path (possibly itself 
containing several latches) in the computational expansion, the difference between the up- 
time associated with the starting latch and the down-time associated with the ending latch 
is shorter than the propagation delay between the two latches, then the expansion is not 
properly timed. Conversely, they show that if, for all paths between latches in the compu- 
tational expansion, the propagation time does not exceed this "up-to-down" time, then the 
circuit is properly timed (see Theorem 4.1 of [20]). The infinite set of linear inequalities 
that compare up-to-down times with propagation delays is called the set of A.- constraints 
for the circuit. Formally, the set of A-constraints for a two-phase, level-clocked circuit G 
can be defined as 

A = {d(a) < t" — t : v t > has down-time t",u t has up-time t, and 
a is a path in Gcx from u t to v t >}, 

where d(a) equals the total propagation delay of all components in the path a. 
We can now prove the following lemma. 

Lemma 55 Let u t and v t > be latches in the computational expansion Gcx °f a two-phase, 
level-clocked circuit G. Moreover, let a = v ,v 1 , . . . , v k be a path from u t to v t > in Gcx, 
and let a' = v' , v[, . . . , v' k be a path from u to v in G such that for i = 0,1,..., k, if 
Vi = w t n, G Gcx then v[ = w G V. Then the following statements hold. 



(i) The up-to-down time t" — t satisfi. 



es 



n (ii£j_J.j + (f) x{u) ifx(u)=x(v), 
t"-t={ (A.l) 

K i^?-) - 7i-x(«) */ x(u) + x(v); 

where l(cr') denotes the number of latches in the path a' . 
(ii) The ^.-constraint d(a) < t" — t holds if and only if 

* y-^F 1 ) + 4>x(u) ifx(u) = x(v), 



d(a r ) < { 



* (2) ~ 7i-x(«) '/xW^xW' 



(A.2) 
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Proof. First, we show that statement (i) holds by induction on the number of latches 1(a). 
Let us consider the basis of the induction for 1(a) = 2. In this case we have x( v o) ¥" x( v k)- 
By definition of the fall-time t" we obtain 

t" = t' + (f) x ( Vk ) 

= t' ' + 4> x (v' k ) , (A. 3) 

since x( v k) = x( v 'k)- By definition of Gqx we have 

t' = t + (^ x(va) + ^ x{va) ) 

= * + (4>xK) + 7x«)) , (A.4) 

since x( v o) = x( v b)- Thus, by substitution of Equations (A. 3) and (A.4), and the definition 
of a two-phase clocking scheme, we obtain 

t"-t = (V + </> xK) )-i 

= (* + </>x«)+7x«)) +<f>x(v' h )- t 

= 4>x(v' ) + lx(v' ) + 4>x(v' k ) 

= 0x«)+7x«)+0i-xK) 

= 7T - 7l-x«) . 

and, since 1(a) = l(a'), Equation (A.l) is satisfied. The base case for 1(a) = 3 can be shown 
similarly. For the inductive step, we assume that the lemma holds for all paths a such that 
1(a) < m, and in a way similar to the base case we can show that Equation (A.l) holds for 
all a such that 1(a) = m. 

Statement (ii) follows immediately from Equation (A.l) and the fact that d(a') = d(a) 
by construction of a 1 . □ 

Assuming that the definition of Gqx specifies a proper computational expansion, the 
proposition now follows immediately. 

Proposition 1 A two-phase, level-clocked circuit is properly timed if and only if for all 
latches A and B in the circuit, the propagation delay d(p) along any path p from A to B is 
no greater than the rise-to-fall time r(p) of the path. 

Proof. Since there is a one-to-one correspondence between circuit components in a two- 
phase, level-clocked circuit and vertices in the graph representation from [20], and r(p) is 
exactly equal to the value denoted by the expression H" — t" in Lemma 55, the proposition 
follows from Lemma 55 and Theorem 4.1 from [20]. □ 

All that remains to be shown is that the definition of Gqx is, in fact, equivalent to the 
definition of the computational expansion that appears in [20]. 

Unlike the simplified definition presented above, the definition of Gqx from [20] makes 
reference to a base- step function B that maps pairs (v, k), where v G V and k = — 1, 0, 1, 2, . . ., 
to the integers [—1, 0, 1,2,...]. The integer argument k and the integer result are indexes 
into the infinite sequence of maximal time intervals over which the clocks of the circuit 
maintain a constant value. For example, in a two-phase, nonoverlapping clocking scheme, 
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the index —1 corresponds by convention to the interval (—00, t ), the index corresponds 
to the first high-pulse of <f> , the index 1 corresponds to the first gap after the high-pulse of 
4> , the index 2 corresponds to the first high-pulse of <f> 1: and so forth. Each such maximal 
time interval is called a step. Intuitively, if B(v,i) = i, then the clock transition at the 
beginning of the ith step directly causes a change in the value output by v. If B(v,i) < i, 
then the value output by v changes because of a clock transition that occured at a step 
earlier than i. Given a circuit G = (V,E), the computational expansion [20] is defined to 
be (Vex, E ex), where 

Vex = {vk '■ v G V and B(v, k) = k}, 

Ecx = { u i — > v k '■ u — > v G E, B(u, k) = I, and B(v, k) = k, }, 



and 



max( Mjtl ) eB B(u, k) if k 7^ — 1, and v is a functional element 
B(v,k-1) 



B(v,k) 



B(v,k 



if k 7^ — 1, and v is a latch whose clock is 
Low during step k\ 

if k / — 1, u — > v G £, v is a latch whose 
clock is High during step fc, and 
B(v,k- 1) > B(u,k); 

if k ^ —1, u ^> v £ E, v is & latch whose 
clock is High during step fc, and 
B(v,k- 1) < B(u,k); 

if fc 7^ — 1, and i; is a latch whose clock is 
High during step k and Low during steps 
— 1 through k — 1; 



if Jfc 



-1. 



The definition is somewhat complex, due to the fact that the definition from [20] applies 
to a more general class of circuits. The following lemma shows, however, that for two- 
phase, level-clocked circuits, this definition is equivalent to the definition presented in the 
beginning of this appendix. 

Lemma 56 Let G = (V,E) be a two-phase, level-clocked circuit that employs a clocking 
scheme ir = (^0,70,^1,71), and let Gqx be its computational expansion with base-step- 
function B. Then, the following statements hold. 

(i) For every vertex v G V , we have B(v, k) = k if and only if either the input phase of 
v makes a Low-fo-HlGH transition at the start of the step denoted by k or the step 
denoted by k is (—oo,t ). 

(ii) For every edge u — > v € E, if B(v,k) = k, then < B(v,k) — B(u,k) < 2, that is, 
step B(u,k) never preceeds step B(v,k), and no clock rises between steps B(u,k) and 
B(v,k). 
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Proof. We first show that statement (i) holds. 

(=>■) The proof is a straightforward case analysis on the possible values of k, and it is 
based on the definition of B and the fact that the two phases in ir are not overlapping. 

For k = — 1, the last branch in the definition of B specifies that B(v, —1) = —1 for all 
v € V, which by convention denotes the interval (— oo,t ). 

For k > — 1, we consider latches and functional elements separately. If v is a latch, 
B(u, k) = k only if either the fourth or the fifth branch in the definition of B apply. In both 
cases, the phase that controls v must be High. Since fa and fa are nonoverlapping, we 
conclude that the controlling phase is Low during step k — 1, and thus, it makes a Low- 
to-HiGH transition at step k. If v is a functional element, we can always find a latch v' 
that leads to v along a latch-free path, since there are no latch-free cycles in the circuit. In 
this case, the first branch in the definition of B applies for every functional element on the 
path, and we obtain B(v',k) = B(v, k). Thus, the problem reduces to the case B(v', k) = k, 
where v' is a latch, and statement (i) holds in the forward direction. 

(•<=) The proof of the backward direction is by induction on the number of steps k. For 
simplicity, we give the proof only for latches. As in the forward direction, the proof for 
functional elements is a straightforward reduction to the proof for latches. 

The basis of the induction for k = — 1 follows from the last branch of £?'s definition. 
The basis for k < 3 follows from the fifth branch of £?'s definition. 

For the inductive step, consider k > 4, and let v be a latch whose input phase is fa. 
(The case for fa is similar.) According to the definition of the base-step function, the value 
B(v,k) depends on B(v,k — 1) and B(u,k), where u — > v is an edge in G. The phase 
fa is Low during steps k — 1 through k — 3, since fa and fa are not overlapping, and fa 
rises at the start of step k. By the second branch in the definition of B, it follows that 
B(v,k — 1) = B(v,k — 2) = B(v,k — 3) = B(v,k — 4). By the inductive hypothesis, the 
lemma holds for all integers smaller than k, and thus, B(v, k — 4) = k — 4, since fa rises at 
the start of step k — 4. Moreover, since the input phase of u must be fa, we infer using a 
similar reasoning that B(u, k) = B(u, k — 1) = B(u, k — 2) = k — 2. Since k — 4 < k — 2, 
the fourth branch of the definition applies, and it follows that B(v,k) = k. Therefore, 
statement (i) also holds in the backward direction. 

Now, we show that statement (ii) holds. 

In order to show that B(u,k) < B(v,k), it suffices to prove that B(u,i) < i, for every 
vertex u G V and for every integer i > — 1. This proof is a straightforward induction on i 
that directly applies the definition of B. 

In order to show that B(v, k) < B(u, k) + 2, we consider functional elements and latches 
separately. If v is a functional element, then B(u,k) = B(v,k), and the inequality holds. 
If v is a latch, then the clocking input of u is Low, since B(v, k) = k implies that the 
clocking input of v is High, and the two phases are nonoverlapping. If u is a latch, then 
B(u, k) = B(u, k — 1) = B(u, k — 2) from the second branch in the definition of B. (If u is a 
functional element, then these equalities still hold by virtue of the first branch in B and the 
fact that there are no latch-free cycles in the circuit.) Statement (i) and the fact that the 
clocking input of u rises at k — 2 yield B(u, k — 2) = k — 2. Thus, the inequality is satisfied 
with equality in this case, and therefore, statement (ii) holds. □ 
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A. 2 Upper Bound on Relative Speedup 

In this section we prove an upper bound on the relative speedup that can be achieved 
by two-phase level-clocking over edge-triggering. This bound is expressed in terms of the 
maximum gate delay c? max and the maximum delay-to-register ratio R around the cycles of 
the edge-triggered circuit. 

The following two theorems give bounds on the clock period that can be achieved by 
retiming and tuning. The first theorem is a restatement of Theorem 3 which provides a 
lower and an upper bound for retiming edge-triggered circuitry. The only difference between 
the two statements is on the left-hand side of the theorem's inequality which now includes 
the obvious lower bound c? max on the clock period of the circuit. 

Theorem 57 ([45]) Let G et = (V,E,d,w) be an edge-triggered circuit, with delay d(v) for 
each gate v G V and w(e) latches on each wire e G E. Let 

i?=max E -^ W 



CeG -E eeC ™(e) 

be the maximum ratio of total delay over total number of edge-triggered latches in the circuit 
G et . Moreover, let d max denote the maximum gate delay in G et , and let Q m \ n {G et ) denote 
the minimum clock period that we can obtain by retiming G et . Then 

max{d max , R} < $ min (G et ) < R + d max . □ 

The second theorem provides a lower bound for retiming and tuning two-phase circuitry. 

Theorem 58 ([21]) Let G ic = (V,E,d,w) be a two-phase circuit, with delay d(v) for each 
gate v G V and w(e) latches on each wire e G E. Let 

S = 2.maxfi^ W 



ceG lc J2eecw(e) 

be the maximum ratio of delay over number of latches around the cycles in G ic . Then the 
minimum clock period & mm (Gi c ) we can obtain by retiming and tuning G ic satisfies 

m&x{d max ,S} < $ min (G, c ) . □ 

Note that for a given edge-triggered circuit and its corresponding two-phase, level-clocked 
implementation that is obtained by replacing each edge-triggered latch with a pair of level- 
clocked latches, the lower bounds R and S are equal. 

We use the bounds in Theorems 57 and 58 to prove the following upper bound on the 
speedup that can be achieved by switching from edge-triggering to level-clocking. 

Lemma 59 Let G et be an edge-triggered circuit, and let Gi c be a two-phase circuit that 
is obtained by replacing each edge-triggered latch in G et by a pair of level-clocked latches. 
Let Q m \ n {G et ) be the minimum clock period that we can achieve by retiming G et , and let 
$ min (G; c ) be the minimum clock period that we can achieve by retiming G ic and simultane- 
ously tuning its clocking scheme. Then 

$min(Gic) ~ m&x{R,d ma , x } 



A.2. UPPER BOUND ON RELATIVE SPEEDUP 121 

The ratio (R + e? max ) / max{i?, c? max } can be used as a predictor of the speedup that may be 
achieved by two-phase clocking. The simple intuition behind this heuristic is that for greater 
values of the upper bound we have more space for improvement. There is no guarantee, 
however, on the actual improvement, since we have no lower bounds. The upper bound in 
maximized for R = d max . As we see in the experimental results of Section 4.2, it is when R 
approaches c? max that two-phase clocking becomes faster than edge-triggering. 
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